What’s Your Story: Emre Kiciman

What’s Your Story: Emre Kiciman

What's Your Story podcast | Emre Kiciman

In the Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today. 

In this episode, Gehrke is joined by Senior Principal Research Manager Emre Kiciman. Kiciman’s work in causal machine learning has resulted in tools for finding meaning in data, including the DoWhy library for modeling and testing causal assumptions, and his study of AI is focused on advancing toward systems that not only are more secure but are as positive in their impact as possible. In this episode, Kiciman shares how a side business pursued by his dad opened the door to computing; why his PhD adviser strongly recommended not using the words “artificial intelligence” in his thesis; and the moments that precipitated his moves from systems and networking to computational social science and now causal analysis and large-scale AI applications.

Emre Kiciman - panel of three photos from childhood

Learn more:

Emre Kiciman at Microsoft Research 

AI Controller Interface: Generative AI with a lightweight, LLM-integrated VM 
Microsoft Research blog, February 2024 

AICI: Prompts as (Wasm) Programs (opens in new tab) 
GitHub repo 

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma 
Microsoft Research Podcast, June 2023 

Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization 
Publication, January 2023

An Open Source Ecosystem for Causal Machine Learning (opens in new tab) 
PyWhy.org 

U Rank Demo Screencast 
September 2008 

Transcript

[TEASER]     

[MUSIC PLAYS UNDER DIALOGUE]

EMRE KICIMAN: I think it’s really important for people to find passion and joy in the work that they do. At some point, do the work for the work’s sake. I think this will drive you through the challenges that you’ll inevitably face with any sort of project and give you the persistence that you need to really have the impact that you want to have. 

[TEASER ENDS]  

JOHANNES GEHRKE: Microsoft Research works at the cutting edge. But how much do we know about the people behind the science and technology that we create? This is What’s Your Story, and I’m Johannes Gehrke. In my 10 years with Microsoft, across product and research, I’ve been continuously excited and inspired by the people I work with, and I’m curious about how they became the talented and passionate people they are today. So I sat down with some of them. Now, I’m sharing their stories with you. In this podcast series, you’ll hear from them about how they grew up, the critical choices that shaped their lives, and their advice to others looking to carve a similar path.   

[MUSIC FADES] 


In this episode, I’m talking with Emre Kiciman, the senior principal research manager leading the AI for Industry research team at Microsoft Research Redmond. After completing a PhD in systems and networking in 2005, Emre began his career with Microsoft Research in the same area, studying reliability in large-scale internet services. Exposure to social data inspired him to refocus his research pursuits: his recent work in causal analysis—including DoWhy, a Python library for causal inference—is helping to connect the whats and whys in the abundance of data that exists. Meanwhile, his work with large language models is geared toward making AI systems more secure and maximizing their benefit to society. Here’s my conversation with Emre, beginning with some of his work at Microsoft Research and how he landed in computer science. 

GEHRKE: Welcome to What’s Your Story. So can you just tell us a little bit about what you do at MSR [Microsoft Research]?

KICIMAN: Sure. I work primarily on two areas at the moment, I guess. One is causal analysis, where we work on trying to answer cause-and-effect questions from data in a wide variety of domains, kind of, building that horizontal platform. And I work a lot recently, especially with this large language model focus, on the security of AI-driven systems: how do we make sure that these AI systems that we’re building are not opening up new vulnerabilities to attackers? 

GEHRKE: Super interesting. And maybe we can start out even before we go more in depth into that by, you know, how did you actually end up in computer science? I learned that you grew up in Berkeley. 

KICIMAN: Yeah, on average, I like to say.  

GEHRKE: On average? [LAUGHTER] 

KICIMAN: So I moved to the US with my parents when I was 2 years old, and we lived in El Cerrito, a small town just north of Berkeley. And then around middle school age, we moved to Piedmont, just south of Berkeley. So on average, yes, I grew up in Berkeley, and I did end up going there for college. And you asked about how I got into computer science. When I was probably around third or fourth grade, my dad, who was a civil engineer, decided that he wanted to start a business on the side, and he loved software engineering and wanted to build software to help automate a lot of the more cumbersome design tasks in the design of steel connections, and so he wrote … he bought a PC and brought it home and started working on his work. But then that was also my opportunity to learn what a computer was. 

GEHRKE: So that was your first computer? Was it an x86? 

KICIMAN: Yes, it was an IBM PC, the first x86, the one before the 286. And—it wasn’t the very original PC. It did have a CGA—color graphics adapter—so we could have four colors at once.  

GEHRKE: Nice. 

KICIMAN: And, yeah, that’s … it came with—luckily for me, I guess—it came with a BASIC manual. So reading that manual is how I learned how to program. 

GEHRKE: And this is the typical IBM white box with a monitor on top of it and a floppy drive, or how should I picture it? 

KICIMAN: Yeah, two floppy drives …  

GEHRKE: Two floppy drives? OK …  

KICIMAN: Two floppy drives, yeah, so you could copy from one to the other.  

GEHRKE: Five and a quarter or three and a half? 

KICIMAN: Five and a quarter, yeah, yeah. The loud, clickety-clack keyboard and, yeah, a nice monitor. So not the green and black; the one that could display the colors. And, yeah, had a lot of fun with programming. 

GEHRKE: So what were some of the first things that you wrote? 

KICIMAN: A lot of the first ones were just the examples from the book, the for loops, for example. But then after that, I started getting into some of the, you know, building, like, little mini painting tools. You know, you could move a cursor around the screen, click a button and paint to fill in a region, and then save the commands that you did to make graphics. Eventually, that actually turned into, like, a friend and I really enjoyed playing computer games, so we had in our mind we’re going to build a computer game. 

GEHRKE: Who doesn’t think that.  

KICIMAN: Of course, right? 

GEHRKE: Of course … 

KICIMAN: And so we had, like, a “choose your own adventure”–style program. I think we had maybe even four or five screens you could step through, right. And he was able to get some boxes, and we printed some manuals even. We had big plans, but then we didn’t know what to do, how to finish the game, how to get it out there, so … but we had a lot of fun.  

GEHRKE: Wow, that sounds amazing. 

KICIMAN: Really fond memories, yeah. 

GEHRKE: That sounds amazing. And then you went to Berkeley afterwards? Is that how you realized your passion, or how do you decide to study computer science?

KICIMAN: Yeah … so from that age, I was set on computing. I think my parents were a bit of a devil’s advocate. They wanted me to consider my options. So I did consider, like, mechanical engineering or industrial engineering in, like, maybe junior year of high school, but it never felt right. I went into computing, had a very smooth transition into Berkeley. They have a local program where students from the local high school can start to take college classes early. So I’d even started taking some computer classes and then just went right into my freshman year. 

GEHRKE: Sounds like a very smooth transition. Anything bumpy? Anything bumpy on the ride out there, or …?  

KICIMAN: Nothing really, nothing really bumpy. I had one general engineering class that somehow got on my schedule at 8 AM freshman year. 

GEHRKE: [LAUGHS] That’s a tough one.  

KICIMAN: That’s a tough one, yeah. And so there were a few weeks I didn’t attend class, and I knew there was a midterm coming up, so I show up. Because, you know, next week, there’s a midterm. I better figure out what they’re, what they’re learning. And I come in a couple minutes late because it’s, even though I’m intending to go, it’s still an 8 AM class. I show up a few minutes late, and everyone is heads down writing on pieces of paper. The whole room is quiet. And the TA gives me a packet and says, you might as well start now. “Oh no.” And I’m like freaking out. Like this is, this is a bad dream. [LAUGHS] And I’m flipping through … not only do I not know how to answer the questions; I don’t understand the questions, like the vocabulary. It’s only been three weeks. How did they learn so much? And then I noticed that it’s an open-book exam and I don’t have my book on top of it, like … but what I didn’t notice and what became apparent in about 20 minutes … the TA clapped his hands, and said, “All right, everyone, put it down. We’ll go over the answers now.” It was a practice. 

GEHRKE: Oh, lucky you. 

KICIMAN: Oh, my god, yes. So I did nothing but study for that exam for the next week and did fine on it. 

GEHRKE: So you didn’t have to drop the class or anything like that? 

KICIMAN: No, no, no. I studied enough that I did reasonably, you know, reasonably well.  

GEHRKE: At what point in time was it clear to you that you wanted to do a PhD or that you wanted to continue your studies? 

KICIMAN: I tried to explore a lot during my undergrad, so I did go off to industry for a summer internship. Super fun.  

GEHRKE: Where did you, where did you work?  

KICIMAN: It was Netscape. 

GEHRKE: Oh Netscape. 

KICIMAN: And it was a joint project with IBM. 

GEHRKE: Which year was that in? 

KICIMAN: This would have been ’90, around ’93.1 

GEHRKE: ’93 … OK, so the very early days of Netscape, actually. 

KICIMAN: Yeah, yeah. They were building Netscape Navigator 4, and the project I was on was Netscape Navigator for OS/2.  

GEHRKE: OK.

KICIMAN: IBM’s OS/2 had come out and was doing poorly against NT, and they wanted to raise its profile. And this team of 20 people were really just focused on getting this out there. And so I always thought of, you know—and I was an OS/2 user already, which is how I got onto that project. 

GEHRKE: OK … And how was the culture there, or …?  

KICIMAN: The culture, it’s what you would think of as a startup culture. You know, they gave out all their meals. There was lots of fun events. You know, dentists came into the parking lot like once a month or something like that. 

GEHRKE: Dentist?  

KICIMAN: There was, like, a yeah, it was, yeah, you know, everyone’s working too much at the office, so the company wanted to make things easy.  

GEHRKE: That sounds great. 

KICIMAN: But the next summer then, I did a research internship, a research assistantship, at Berkeley. I worked with Randy Katz and Eric Brewer and got into, you know, trying to understand cellphone networks and what they were thinking about, you know, cloud infrastructure for new cellular technologies. 

GEHRKE: And Eric Brewer, was he, at that point in time, already running Inktomi, or … ? 

KICIMAN: He was already running Inktomi. Yeah, yeah, he’d already started it. I don’t think it was public yet at the time, but maybe getting there.  

GEHRKE: OK. Well, this was right at the beginning when, like, all the, you know, cloud infrastructure was defined and, you know, a lot of the basics were set. So you did this internship then in your, after your junior year, the second one?  

KICIMAN: Yeah, after my junior year. It was then senior year, and it was time to apply for, you know, what’s going to come after college. And I knew it … after that assistantship at Berkeley, I knew I was going to go do a PhD. 

GEHRKE: So what is the thing about the internship that made you want to stay in research? 

KICIMAN: Oh, it’s just the … it gave a vision of the future. Like, we were playing with, like, you know, there were people in the lab playing with video over the internet and, you know, teleconferencing, and just seeing that, it felt like you were seeing into the future and diving deep technically across the stack in a way that the industry internship hadn’t done. And so that part of it and obviously lots of particulars. You know, lots of internships do go very deep in industry, as well, but that’s what struck me, is that, kind of, wanting to learn was the big driver.  

GEHRKE: And what excited you about systems as compared to something that’s more applications-oriented or more touching the user? I feel like systems you always have to have this, kind of, drive for infrastructure and for scale and for, you know, building the foundation as compared to, like, directly impacting the user. 

KICIMAN: I think the way I think about systems today—and I can’t remember what it was about systems then. I’d always done operating … like, operating systems was one of my first upper-division courses at Berkeley and everything. So, like, I certainly enjoyed it a lot. But the way I think about systems now—and I think I do bring systems thinking to a lot of the work I do, even in AI and responsible AI—is the way you structure software, it feels like you should be making a statement about what the underlying problem is, what is the component you should be building from an elegance or first-principles perspective. But really, it’s about the people who are going to be using and building and maintaining that system. You want to componentize it so that the teams who are going to be building the bigger thing can work independently, revise and update their software without having to coordinate every little thing. I think that’s where that systems thinking comes in for me, is what’s the right abstraction that’s going to decouple folks from each other. 

GEHRKE: That’s a really great analogy because the way it was once told to me was that systems is really about discovering the beauty in large software. Because once you touch the user, you, sort of, have to do whatever is necessary to, you know, make the user happy. But in the foundations, you should have simplicity; you should have ease; you should have elegance. Is that how you think about it? 

KICIMAN: I do think about those aspects, but it’s for a purpose. You know, you want the elegance and the simplicity so that you can have, you know, one team working on Layer 1 of the stack, another team working on Layer 2 of the stack, and you don’t want them to have to talk to each other every 10 minutes when they’re making any change to any line of code, right. And so thinking about, what is the more fundamental layer of abstraction that lets these people work on separate problems? That’s what’s important to me. And, of course, like, that then interplays with people’s interests and expertise. And as people’s expertise evolves, that might mean that that has implications for the design of your system.  

GEHRKE: And so you’re, OK, you’re an undergrad. You have done this research experience; you now apply. So now you go to grad school. Do you do anything fun between your undergrad and grad school? 

KICIMAN: No, I went straight in. 

GEHRKE: Right straight in?  

KICIMAN:  Right straight in. I did my PhD at Stanford. So I went, you know, a little way to school. 

GEHRKE: To a rival school, isn’t it? Isn’t it a big rival school? 

KICIMAN: To a rival school. Well, the undergrad school wins. I think that’s the general rule of thumb. But I did continue working with folks at Berkeley. So my adviser was also from Berkeley and so …  

GEHRKE: Who was your adviser? 

KICIMAN: My adviser was Armando Fox, …  

GEHRKE: OK, yeah. Mm-hmm.  

KICIMAN: and we had a … 

GEHRKE: Recovery-oriented computing? 

KICIMAN: Yes, exactly. Recovery-oriented computing. And the other person on the recovery-oriented computing project …  

GEHRKE: Dave Patterson …  

KICIMAN: … was Dave Patterson, yeah. 

GEHRKE: So it was really a true, sort of, Stanford-Berkeley joint project in a way?

KICIMAN: Yes, yeah. And that was my PhD. The work I did then was the first work to apply machine learning to the problem of fault detection and diagnosis in large-scale systems. I worked with two large companies—one of them was Amazon; one of them was anonymous—to test out these ideas in more realistic settings. And then I did a lot of open-source work with J2EE to demonstrate how you can trace the behavior of a system and build up models of its behavior and detect anomalies. Funnily enough, I know this is going to sound a little alien to us now maybe in today’s world: Dave and Armando would not let me use the phrase “artificial intelligence” anywhere in my thesis because they were worried I would not be able to get a job. 

GEHRKE: I see. Because that was, sort of, one of … I mean, AI goes through these hype cycles and then, you know, the winters again, and so this was one of the winter times? 

KICIMAN: This was definitely a wintertime. I was able to use the phrase “machine learning” in the body of the thesis, but I had to make up something about statistical monitoring for the title. 

GEHRKE: So what is the actual final title of your thesis, if you remember it? 

KICIMAN: “Statistical monitoring for fault detection and diagnosis in large-scale internet services” or something like that. 

GEHRKE: Makes sense. 

KICIMAN: Yeah. 

GEHRKE: So you replaced AI with statistical modeling and then everything [turned out all right]? 

KICIMAN: Yes, yeah. Everything … then it didn’t sound too hype-y. 

GEHRKE: And then after your PhD, you went straight to MSR, is that right? 

KICIMAN: Yeah. I mean, so here I’m coming out of my PhD with a focus on academic-style research for large-scale systems. Kind of boxed myself in a little bit. No university has a large-scale internet service, and most large-scale internet service companies don’t have research arms. So Microsoft Research was actually the perfect fit for this work. And when I got here, I started diving in and actually expanding a little bit and thinking about what are the end-to-end reliability issues with our services. So assume that the back end is running well. What else could go wrong that’s going to get in the way of the user? So I had one project going on, wide area network reliability with David Maltz, and one project …  

GEHRKE: Who is now CVP in Azure.  

KICIMAN: Who’s now, yeah, leading Azure network—the head of Azure networking. And one project on how we can monitor the behavior of our JavaScript applications that were just starting to become big. Like around then is when, you know, the first 10,000-line, 100,000-line-of-code JavaScript applications [were] appearing, and we had no idea whether they were actually running correctly, right? They’re running on someone else’s browser and someone else’s operating system. We didn’t know.  

GEHRKE: A big one at that point in time, I think was Gmail, right? This was, sort of, a really big one. But did we have any big ones in Microsoft? 

KICIMAN: Gmail was the first big one in the industry. 

GEHRKE: Hotmail, was it also Java, based in JavaScript? 

KICIMAN: Hotmail was not initially JavaScript based. The biggest one at that time was our maps. Not Bing maps, but whatever we called it.  

GEHRKE: MSN maps, or …  

KICIMAN: Probably something like that, yeah, yeah.  

GEHRKE: I see. And so you applied your techniques to that code base and tried to find a lot of bugs? 

KICIMAN: Yeah, this project was—and this was about data gathering, right, so I’m still thinking about it from the perspective of how do I analyze data to tell me what’s going on. We had data for the wide area network, but these web applications, we didn’t have any. So I’m, like, I’m going to build this infrastructure, collect the data, so that in a couple years, I can analyze it. And so what I wrote was a proxy that sat on the side of the IAS server and just dynamically instrumented all the JavaScript that got shipped out. And the idea was that no one user was going to pay the cost of the instrumentation, but everyone would pay a little small percentage, and then you could collect it in the back end to get the full complete picture.  

GEHRKE: Right. It’s so interesting because, I mean, in those days, right, you still thought maybe in terms of years and so on, right. I mean, you’ve said, well, I instrumented, then maybe in a year, I have some data. And today it happens that I instrument, and tomorrow I have enough data to make a decision on an A/B test and so on, right. It was a very different time, right. And also, it was probably a defining time for Microsoft because we moved into online services, right. We moved into large-scale internet services. So it must have been exciting to be in the middle of all of this. 

KICIMAN: It really was. I mean, there was a lot of change happening both inside Microsoft and outside Microsoft. That’s when … soon after this is when social networking started to become big, right. You started seeing Facebook and Twitter show up, and search became a bigger deal for Microsoft when we started investing in Windows Live and then Bing, and that’s actually … my manager, Yi-Min Wang, actually joined up with Harry Shum to create the Internet Services Research Center with the specific focus of helping Bing. And so that also shifted my focus a little bit and so had me looking more at some of the social data that would, kind of, take my trajectory on a little bit further.

GEHRKE: Right. I mean, so you’re unique in that, you know, people very often, they come in here and, you know, they’re specialists in systems, and they branch out within systems a little bit and, you know, of course, move with time. Maybe now they do, you know, AI infrastructure. But you have really moved quite a bit, right. I mean, you did your PhD on systems … I mean, systems and AI really, the way I understand it. Then you worked here a little bit more on systems in wide area and large-scale systems. But then, you know, you really became also an expert in causality and looked at, sort of, the social side. And now you, of course, have started to move very deeply into LLMs. So rather than talking about the topics itself, how do you decide? How do you make these decisions? How do you … you know, you’re a world expert on x, and how do you, in some sense, throw it all away and go to y? Do you decide one day, “I’m interested in y“? Do you, sort of, shift over time a little bit? How do you do it? 

KICIMAN: I’ve done it, I think, two or maybe three times, depending on if you count now, and some transitions have gone better than others. I think my transition from systems to social data and computational social science, it was driven by a project that we did for search at the time. Shuo Chen, another researcher here at Microsoft Research, built a web application that lets you give very concrete feedback back to Windows Live. You could drag and drop the results around and say, this is what I wanted it to look like. And this made, you know, feedback much more actionable and helped really understand DSATs and where they’re coming from. DSAT being dissatisfactions. And I looked at that and I was like, I want to be able to move search results around and share with my friends. And I, kind of, poked at Shuo, you know, asked him if he would build this, and he said no. He said he’s busy. So eventually, I—because I knew something about JavaScript applications—decided to just drop things and spend six months building out this application. So I built out this social search application where you could drag and drop search results around, share it with your friends, and we put it out, actually. We got it deployed as an external service. We had maybe 10,000 people kick the tires.  

GEHRKE: Within Microsoft or …?  

KICIMAN: No, externally.  

GEHRKE: OK.  

KICIMAN: Yeah. There was a great headline that, like, Google then fast followed with a similar feature, and the headline was like, Google fast follows, basically, on Microsoft. Our PR folks were very excited about that. I say this all … I mean, it’s all history now. But certainly, it was fun at the time. But now we’re … I’m giving this demo, this talk, about this prototype that we built and what we’re learning about, you know, what’s in people’s way, what’s friction, what do they like and not like, etc. And I’m standing up and, you know, giving this presentation, this demo, and someone says, hey could you, could you go back to, you know, go back in the browser? On the bottom right corner, it says Mike did something on this search page; he edited some search results. Could you click on that? I want to know what he did. I’m like, OK, yeah, sure. I click on it. And [it’s like], OK, that’s great. That’s, that’s really interesting. And this happened multiple times. Like, in a formal presentation, for someone to interrupt you and ask a personal question just out of their own curiosity, that’s what showed me … that’s what got me really thinking deeply about the value of this social data and, like, why is it locked up in a very specific interface. What else could you do with this data if it’s so engaging, so fascinating, that people are willing to interrupt a speaker for some totally irrelevant, basically, question? And that’s when I switched to really trying to figure out what to do with social data. 

GEHRKE: I see. So it was this, kind of, really personal experience of people being so excited about that social interaction on the demos that you’re giving. 

KICIMAN: Exactly. They cared about their friends and what their friends did, and that was super clear.

GEHRKE: So, so coming back, let’s go there in a second, but coming back to the story that you told, you said you had 10,000 external users. 

KICIMAN: Yeah.

GEHRKE: So I’m still, you know, also always trying to learn what we can do better because we sometimes have prototypes that are incredibly valuable. They’re prototypes that have fans; they’re prototypes that, you know, the fans even want to contribute. But then somehow, we get stuck in the middle; and they don’t scale, and they don’t become a business. What happened with that?

KICIMAN: Yeah. 

GEHRKE: Also in [retrospect], … 

KICIMAN: In retrospect … 

GEHRKE: … what, what … should we have done something different, or did it live up to its potential? 

KICIMAN: I think we learned something. I think that there were a couple of things we learned. One was that, you know, every extra click that people wanted to do, you know, took the number of interactions down by, you know, an order of magnitude. So starring something and bringing it to the top, that was very popular. Dragging and dropping? Little bit less so. Dragging and dropping from one search to a different search? So maybe I’ll search for, you know, “Johannes,” find your homepage, and then drag and drop it to, like, people’s, you know, publications list to, like, keep an eye on or something. Like that, almost never. And people were very wary about editing the page. Like, what if I make a mistake? What if it’s just, just me, like, who wants this, and I’m messing up search for the rest of the world? And it’s like, no, no, it’s just your friends, like just you and your friends who are going to see this. And so we learned a lot about people’s mental models and, like, what stood in the way of, you know, interactions on the web. There were lots of challenges to doing this at scale. I mean, we needed, for example, a way of tracking users. We needed a way of very quickly, within 100 milliseconds, getting information about a user’s past edits to search pages into, you know, into memory if we were going to do this for real on Windows Live. And we just didn’t have the infrastructure.

GEHRKE: I see. And those problems were hard in those days. 

KICIMAN: Yeah. A prototype is fine. People, you know, will handle a little bit of latency if it’s a research prototype, but for everyday use, you need something more. 

GEHRKE: And there was no push to try it, to land it somehow, or what … ?  

KICIMAN: There were big pushes, but the infrastructure, it was really … 

GEHRKE: I see. It was really an infrastructure problem, then? 

KICIMAN: Yeah, yeah. 

GEHRKE: OK. Interesting because it sounds to me like, wow, there’s an exciting research problem there; now you need the infrastructure to try to make all of these things really, really fast. It’s always fascinating to see, you know, where things get stuck and how they, how they proceed. 

KICIMAN: Yeah, I think it’d be a lot easier to build that—from an infrastructure point of view—today. But, of course, then there’s lots of other questions, like is this really what, you know, the best thing to do. Like I mentioned, Google had this fast follow feature. They also removed it afterwards, as well.  

GEHRKE: OK. Yeah, hindsight is always, you know, twenty-twenty. So, OK, so you’re now starting to move into social computing, right, and trying to understand more about social interactions between users. How did you end up in causality, and then how did you make the switch to LLMs? And maybe even more about this; I mean, I understand here this was, sort of, this personal story that you really saw that, you know, the audience was really asking you about what’s happening here and that, sort of, motivated you. Was it always this personal drive, or was it always others who pulled you? And how did you make these switches? 

KICIMAN: I think the switch from systems into social, it was about trying to get closer to problems that really mattered to people. I really enjoy working on systems problems, but oftentimes, they feel like they’re in the back end. And so I wanted something where, you know, even if I’m not the domain expert working on something, I can feel like I’m making a contribution to that problem. The transition with social data then into causality and, um, and LLMs, that was a bit smoother. So working with social data, trying to understand what it meant and what it said about the world in aggregate, was super-fascinating problems. So much information is embedded in the digital traces that people leave behind. But it was really difficult for people to come to solid conclusions. So there was one conference I went to where almost every presentation that day gave some fascinating insight. This is how people make friendships. This is how, you know, we’re seeing, like, signs of disease spread in, you know, through real-world interactions as they’re in social data. Here’s how people spend their time. And then people would, and then people would close; their conclusion slide every time was, “And, of course, correlation is not causation, so anything could actually be happening.” Like, that is such, that is such a bummer. Like, beautiful theory, great understanding. You spent so much time. I feel like I got some insight. And then you pull the rug out and say, but maybe not. And I’d heard about this work on … that there was work on causal analysis and that there were certain conditions and ways to get actual learned causal relationships from data. So that’s the day I decided I’m going to go figure out what that is and how to apply it to social data for these types of questions. And I went out, and the first work there was a collaboration with Munmun De Choudhury, faculty at Georgia Tech, looking at online traces related to mental health and suicidal ideation and trying to understand what some of the factors were in a more, in a more solid and causal fashion. And so this really became, like, this was … this interest in computational social science really ended up branching out into two areas. One, obviously, I’m caring about, what can we learn about the world? Part of this is, of course, thinking deeply about the implications of AI on society, like what is it going to mean that we have this data for all of these, you know, societal challenges? And then causality. So the AI and its implications on society is what led towards the work on the security of AI systems and now security of AI as it relates to large language models. And then causality was the other branch that split off from there. Both of them really stemming from this desire to see that we have a positive impact with AI.

GEHRKE: So you mentioned that, you know, you were sitting in these talks and people are talking about the correlation, and now you finally have this new tool, which is causation. So what are some of the examples where, you know, with correlation you came out with answer A, but now causation gave you some better, some real deep insights? 

KICIMAN: I haven’t gone looking to refute studies, so … 

GEHRKE: I see. OK.  

KICIMAN: … but there are many well-known studies in the past where people have made mistakes because they didn’t account for the right confounding variables. Ronny Kohavi has a great list of these on one of his websites. But a fun one is a study that came out in the late ’90s on the influence of night lights on myopia in children. So this was a big splash. I think it made it to like Newsweek or 60 Minutes and stuff, that if you have night lights in the house, your kids are more likely to need glasses. And this was wrong. 

GEHRKE: My parents told me all the time, don’t read in bed, you know, with your flashlight because your eyes are going to get bad. 

KICIMAN: Yes.  

GEHRKE: That’s the story basically, right? 

KICIMAN: This was, yeah, the night lights that plug in the wall.  

GEHRKE: But that’s the …  

KICIMAN: That’s the idea, the same thing. 

GEHRKE: The same thing, right. 

KICIMAN: And so these people analyzed a bunch of data, and they found that there was a correlation, and they said that, you know, it’s a cause; you know, this is a cause. And the problem was that they didn’t account for the parents’ myopia. Apparently, parents who had myopia were more likely to install night lights. And then you have the genetic factor then actually causing the myopia. Very simple. But, you know, people have to replicate this study to, you know, to realize it was a mistake. Others were things like correlations, I think, around vitamin C have been reported repeatedly and then refuted in randomized control trials. But there’s many of these. Medicine, in particular, has a long history of false correlations leading people astray. 

GEHRKE: Do you have a story where here at Microsoft your work in causation had a really big impact? 

KICIMAN: You know, the one—it’s still ongoing—but one of the ones that I’m really excited about now, and thinking also from the broader societal impact lens, is a collaboration with Ranveer Chandra and his group. So with a close collaborator at MSR India, Amit Sharma, we’ve developed a connection between representation learning and underlying causal representation of the data-generating process that’s driving something. So if you imagine, like, we want to learn a classifier on an object, on an image, and we want that classifier to generalize to other settings, there’s lots of reasons why this can go wrong. You know, you have, you know, like a classic example is the question of, is this picture showing you a camel, or is it showing you a cow? The classifier is much more likely to look at the background, and if it’s green grass, it’s probably a cow. If it’s sandy desert, it’s probably a camel. But then you fail if you look at a camel in the zoo or a cow on a beach, right. So how do you make sure that you’re looking at the real features? People have developed algorithms for these. But no algorithm actually is robust across all the different kinds of distribution shifts that people see in the real world. Some algorithms work on these kinds of distribution shifts. Some algorithms work on those kinds of distribution shifts. And it was a bit of an interesting, I think, puzzle as to why. And so we realized that these distribution shifts, if you look at them from a causal perspective, you can see that the algorithms are actually imposing different statistical independence constraints. And you can read those statistical independence constraints off of a causal graph. And the reason that some algorithms worked well in some settings was that the underlying causal graph implied a different set of statistical independence constraints in that setting. And so that algorithm was the right one for that setting. If you have a different causal graph with different statistical independence constraints, the other algorithm was better. And so now you can see that no one algorithm is going to work well across all of them. So we built an adaptive algorithm that looks at the causal graph, picks the right statistical independencies, and applies them, and now what we’re doing with this algorithm is we’re applying it to satellite imagery to help us build a more generalizable, more robust model of carbon in farm fields so we can remotely sense and predict what the carbon level is in a field. And so, the early results …  

GEHRKE: And that’s important for what?

KICIMAN: And so this is important because soil is seen as a very promising method for sequestering carbon for a climate change perspective. And it’s also the more carbon there is … the higher your soil carbon, usually the healthier the soil is, as well. It’s able to absorb more water, so less flooding; your crops are more productive because of the microbial growth that’s happening. And so people want to adopt policies and methods that increase the soil carbon in the fields for all of these reasons. But measuring soil carbon is really intensive. You have to go sample it, take it off to a lab, and it’s too expensive for people to do regularly. And so if we can develop remote-sensing methods that are able to take a satellite image and, you know, really robustly predict what the real soil carbon measurement would be, that’s really game changing. That’s something that, you know, will help us evaluate policies and whether they’re working; help us evaluate, you know, what the right practices should be for a particular field. So I’m really excited about that.  

GEHRKE: That’s really exciting. You’d mentioned when we talked before that you’d benefited in your career from several good mentors. How do you think about mentoring, and what are the ways that you benefited from it? And how do you, you know, live that now in your daily life as you’re a mentor now to the next generation? 

KICIMAN: Yeah, the way I look at all the people—and there’s so many—who have, you know, given me a hand and advice and, you know, along the way, I often find I pick up on some attributes of my mentors, of a particular mentor, and find that it’s something that I want to emulate. So recognizing, you know, everyone is complicated and no one is perfect, but, you know, there’s so many ways that, you know, individuals get things right and trying to understand what it is that they’re doing right and how I can try and repeat that for, like, you said, the next generation, I think, is really, really important. It’s like one story, for example, around 2008, while I was still working on large-scale internet services, I was going around the company to, kind of, get a sense of, you know, what’s the current state of the reliability of our services and how we architect them and run them. And so I was talking to developers and architects and Ops folks around the company, and James Hamilton was a great mentor at that moment, helping me to connect, helping suggest questions that I might ask. 

GEHRKE: So he was working on SQL Server reliability, right, at that point in time or on Windows reliability? 

KICIMAN: He was already starting to move over into datacenter reliability. I think at the time, right before he moved over to the research side of things, I think he was one of the heads of the, of our enterprise email businesses, and then he came over to research to focus on, I think, datacenters in general. And, yeah, and he just donated so much of his time. He was so generous with, you know, reviewing this large report that I was writing and just helping me out with insights. That struck me as, like … he’s a very busy person. He’s doing all this stuff, and he’s spending, you know, I sent him an email with, you know, 15 pages, and he responds with feedback within a couple of hours every morning. That was astonishing to me, especially in hindsight, and so … but that kind of generosity of time and trying to help direct people’s work in a way that’s going to be most impactful for what they want to achieve, that’s something I try and emulate today. 

GEHRKE: So, so, you know, you’ve benefited from a lot of great mentors and you said you’re now also a mentor to others. Do you have any last piece of advice for any of our listeners? 

KICIMAN: I think it’s really important for people to find passion and joy in the work that they do and, at some point, do the work for the work’s sake. I think this will drive you through the challenges that you’ll inevitably face with any sort of project and give you the persistence that you need to really have the impact that you want to have. 

GEHRKE: Well, thanks for that advice. And thanks for being in What’s Your Story, Emre. 

KICIMAN: Thanks very much, Johannes. Great to be here.  

[MUSIC] 

To learn more about Emre or to see photos of Emre as a child in California, visit aka.ms/ResearcherStories. 

[MUSIC FADES] 


[1] Kiciman later noted the year he interned at Netscape was 1997. 

The post What’s Your Story: Emre Kiciman appeared first on Microsoft Research.

Read More

Research Focus: Week of July 29, 2024

Research Focus: Week of July 29, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: July 22, 2024

Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior

Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). Previous differentiable MAG learning algorithms have been limited to small datasets and failed to scale to larger ones (e.g., with more than 50 variables).

In a recent paper: Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior, researchers from Microsoft and external colleagues explore the potential for causal skeleton, which is the undirected version of the causal graph, to improve accuracy and reduce the search space of the optimization procedure, thereby enhancing the performance of differentiable causal discovery. They propose SPOT (Skeleton Posterior-guided OpTimization), a two-phase framework that harnesses skeleton posterior for differentiable causal discovery in the presence of latent confounders.

Extensive experiments on various datasets show that SPOT substantially outperforms state-of-the-art methods for MAG learning. SPOT also demonstrates its effectiveness in the accuracy of skeleton posterior estimation in comparison with non-parametric bootstrap-based, or more recently, variational inference-based methods. The adoption of skeleton posterior exhibits strong promise in various causal discovery tasks.


Evaluating the Feasibility of Visual Imagery for an EEG-Based Brain–Computer Interface

Brain signals recorded via non-invasive electroencephalography (EEG) could help patients with severe neuromuscular disorders communicate with and control the world around them. Brain-computer interface (BCI) technology could use visual imagery, or the mental simulation of visual information from memory, as an effective control paradigm, directly conveying the user’s intention.

Initial investigations have been unable to fully evaluate the capabilities of true spontaneous visual mental imagery. One major limitation is that the target image is typically displayed immediately preceding the imagery period. This paradigm does not capture spontaneous mental imagery, as would be necessary in an actual BCI application, but something more akin to short-term retention in visual working memory.

In a recent paper: Evaluating the Feasibility of Visual Imagery for an EEG-Based Brain–Computer Interface, researchers from Microsoft and external colleagues show that short-term visual imagery following the presentation of a specific target image provides a stronger, more easily classifiable neural signature in EEG than spontaneous visual imagery from long-term memory following an auditory cue for the image. This research, published in IEEE Transactions on Neural Systems and Rehabilitation Engineering, provides the first direct comparison of short-term and long-term visual imagery tasks and provides greater insight into the feasibility of using visual imagery as a BCI control strategy.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first three episodes on demand.


Evolving Roles and Workflows of Creative Practitioners in the Age of Generative AI

Many creative practitioners – designers, software developers, and architects, for example – are using generative AI models to produce text, images, and other assets. While human-computer interaction (HCI) research explores specific generative AI models and creativity support tools, little is known about practitioners’ evolving roles and workflows with models across a project’s stages. This knowledge could help guide the development of the next generation of creativity support tools.

In a recent paper: Evolving Roles and Workflows of Creative Practitioners in the Age of Generative AI, researchers from Microsoft and the University of California-San Diego, contribute to this knowledge by employing a triangulated method to capture information from interviews, videos, and survey responses of creative practitioners reflecting on projects they completed with generative AI. Their observations help uncover a set of factors that capture practitioners’ perceived roles, challenges, benefits, and interaction patterns when creating with generative AI. From these factors, the researchers offer insights and propose design opportunities and priorities that serve to encourage reflection from the wider community of creativity support tools and generative AI stakeholders, such as systems creators, researchers, and educators, on how to develop systems that meet the needs of creatives in human-centered ways.


“It’s like a rubber duck that talks back”: Understanding Generative AI-Assisted Data Analysis Workflows through a Participatory Prompting Study

End-user tools based on generative AI can help people complete many tasks. One such task is data analysis, which is notoriously challenging for non-experts, but also holds much potential for AI. To understand how data analysis workflows can be assisted or impaired by generative AI, researchers from Microsoft conducted a study using Bing Chat via participatory prompting, a newer methodology in which users and researchers reflect together on tasks through co-engagement with generative AI. The recent paper: “It’s like a rubber duck that talks back”: Understanding Generative AI-Assisted Data Analysis Workflows through a Participatory Prompting Study, demonstrates the value of the participatory prompting method. The researchers found that generative AI benefits the information foraging and sensemaking loops of data analysis in specific ways, but also introduces its own barriers and challenges, arising from the difficulties of query formulation, specifying context, and verifying results. Based on these findings, the paper presents several implications for future AI research and the design of new generative AI interactions.

The post Research Focus: Week of July 29, 2024 appeared first on Microsoft Research.

Read More

Tracing the path to self-adapting AI agents

Tracing the path to self-adapting AI agents

white line icons on blue and green gradient background

The games industry has long been a frontier of innovation for AI. In the early 2000s, programmers hand-coded neural networks to breathe life into virtual worlds (opens in new tab), creating engaging AI characters (opens in new tab) that interact with players. Fast forward two decades, neural networks have grown from their humble beginnings to colossal architectures with billions of parameters, powering real-world applications like ChatGPT (opens in new tab) and Microsoft Copilots (opens in new tab). The catalyst for this seismic shift in AI scale and capability is the advent of automatic optimization. AutoDiff frameworks like PyTorch (opens in new tab) and Tensorflow (opens in new tab) have democratized scalable gradient-based end-to-end optimization. This breakthrough has been instrumental in the development of Large Foundation Models (LFMs) that now sit at the core of AI.

Today, the AI systems we interact with are more than just neural network models. They contain intricate workflows that seamlessly integrate customized machine learning models, orchestration code, retrieval modules, and various tools and functions. These components work in concert to create the sophisticated AI experiences that have become an integral part of our digital lives. Nonetheless, up to now, we do not have tools to automatically train these extra components. They are handcrafted through extensive engineering, just like how neural networks were engineered in the early 2000s.

End-to-end automatic optimization of AI systems

The latest research from Microsoft and Stanford University introduces Trace (opens in new tab), a groundbreaking framework poised to revolutionize the automatic optimization of AI systems. Here are three highlights of the transformative potential of Trace:

  • End-to-end optimization: Trace treats AI systems as computational graphs, akin to neural networks, and optimizes them end-to-end through a generalized back-propagation approach.
  • Dynamic adaptation: It handles the dynamic nature of AI systems, where the graph can change with varying inputs and parameters and needs to adapt to various kinds of feedback.
  • Versatile applications: Trace can optimize heterogenous parameters (such as prompts and codes) in AI systems. Empirical studies showcase Trace’s ability to optimize diverse problems, including hyperparameter tuning, large language model (LLM) agents, and robot control, often outperforming specialized optimizers.

In a nutshell, Trace is a new AutoDiff-like tool for training AI systems without using gradients. This generalization is made possible by a new mathematical formulation of optimization, Optimization with Trace Oracle (OPTO), which can describe end-to-end optimization of AI systems with general feedback (such as numerical losses, natural language, and errors). Instead of propagating gradients, which are not well-defined for AI systems beyond neural networks, Trace propagates Minimal Subgraphs which can then be used to also recover gradients where applicable. Trace is implemented as a PyTorch-like Python library with which users can easily create AI systems and refine them, akin to training neural networks.

In this blog post, we are excited to announce the release of the Trace Python library (opens in new tab). With the help of demos, we’ll show you how this powerful tool can be used to build AI agents that learn and adapt from their experiences, eliminating the need for specialized engineering.

Microsoft Research blog

Microsoft at FAccT 2024: Advancing responsible AI research and practice

From studying how to identify gender bias in Hindi to uncovering AI-related risks for workers, Microsoft is making key contributions towards advancing the state of the art in responsible AI research. Check out their work at ACM FAccT 2024.


Warm up: Building a Battleship game AI agent through learning

To start, consider building an AI agent for the classic Battleship board game. In Battleship, a player needs to devise strategies to cleverly locate and attack the opponent’s ships on a hidden board as fast as possible. To build an AI agent with Trace, one simply needs to program the workflow and declare the parameters, like programming a neural network architecture. Here we will design an agent with two components: a reason function and an act function, as illustrated in Figure 1a. We provide a basic description of what these two functions should do as docstrings. We leave the functions’ content to be blank and set them to be trainable. At this point, the agent doesn’t know how the Battleship API works. It must not only learn how to play the game, but also learn how to use the unknown API.

The agent’s policy is defined as the composition of a reason step and an act step. The codes of both steps are marked as trainable and are initialized as trivial functions. A basic description of what each function is supposed to behave is provided as docstrings in the function definition.
Figure 1a: Write a Trace-trainable policy.
The agent’s policy is optimized by a simple but generic training loop, which mimics neural network training. First the agent’s policy and an iterative optimizer for it are declared. In each iteration, the agent’s policy takes a board configuration as input and outputs a target location. The environment returns feedback on whether the target successfully hits a ship or not. Alternatively, when the agent’s policy triggers any execution error, the error is used as feedback. Then the feedback is propagated to the parameters in the trainable policy for updates.
Figure 1b: Optimize using a PyTorch-like API.

We iteratively train this AI agent to play the game through a simple Python for loop, seen in Figure 1b. In each iteration, the agent (that is, policy) sees the board configuration and tries to shoot at a target location on a training board. The environment returns in text whether it’s a hit or a miss. Then, we run Trace to propagate this environment feedback through agent’s decision logic to update the parameters (for example, the policy is like a two-layer network with a reason layer and an act layer). These iterations mimic how a human programmer might approach the problem. They run the policy and change the code based on the observed feedback, try different heuristics to solve this problem, and may rewrite the code a few times to fix any execution errors by using stack traces.

In Figure 2, we show the results of this learning agent, where the agent is trained by an LLM-based optimizer OptoPrime in Trace. The performance is measured as the scores of the agent playing on new randomly generated games (different from the training board). We see that the agent understands the Battleship game and proposes the enumeration strategy after one iteration; then, after a few more tries, it starts to develop complex strategies for playing the game.

The experimental results show that Trace can quickly learn complex behaviors for Battleship in a few iterations. At iteration 0, the agent is initialized to output a constant coordinate. At iteration 1, the agent learns the simple strategy of enumerating the board. After a few more iterations (e.g., iteration 7), the agent learns a complex strategy to balance unexplored squares vs. adjacent squares to previous hits. In comparison, the state-of-the-art LLM optimizer OPRO only achieves less than 1/3 of Trace’s performance in this problem.
Figure 2: Trace optimizes Code-as-Parameter to create a complex Battleship AI from scratch, compared with state-of-the-art LLM-based optimizer OPRO.

Super-fast reinforcement learning agent for robot control

We can extend the same idea of end-to-end optimization to train more complicated AI systems. In this example, we want to learn a policy code to control a robotic manipulator. Compared to the Battleship example, the problem here has a longer horizon, since the policy would need to drive the robot for multiple time steps before receiving any feedback. Traditionally, such a problem is framed as a reinforcement learning (RL) problem, and usually learning a policy with RL requires tens of thousands of training episodes. We show Trace can be used to effectively solve such a problem, with just dozens of episodes — a 1,000 times speed-up. We trace an entire episode and perform end-to-end updates through these steps (using the same OptoPrime optimizer). In this way, effectively, Trace performs back-propagation through time (BPTT (opens in new tab)).

We conduct experiments using a simulated Sawyer robot arm in the Meta-World (opens in new tab) environment of LLF-Bench (opens in new tab), as shown in Figure 3. The agent needs to decide a target pose for the robot, which will then be used as a set point for a position controller, to perform a pick-and-place task. Each episode has 10 timesteps, which results in a graph of depth around 30. The agent receives language feedback as intermediate observations (from LLF-Bench) and finally feedback about success and episode return (i.e. cumulative reward for RL) in texts at the end. Like the Battleship example, we initialize the policy code to be a dummy function and let it adapt through interactions, demonstrated in Figure 4. We repetitively train the agent starting from one initial condition, then test it on 10 new held-out initial conditions for generalization. Very quickly, after 13 episodes, we see that the agent learns complex rules to solve the problem, as shown in Figure 3 and Figure 4.

The video shows how the robot agent performs on new configurations which are not seen during training. At iteration 0, the robot’s policy is initialized to stay at its initial position.
The video shows how the robot agent performs on new configurations which are not seen during training. At iteration 1, the robot learns to reach the goal but does not grasp the object, which leads to failure in this pick and place task.
The video shows how the robot agent performs on new configurations which are not seen during training. The robot learns to grasp the object starting from iteration 3 but fails to successfully place and drop the object at the goal correctly. Nonetheless, after dropping the object incorrectly, the robot would attempt to pick up the object and try again. This behavior continues until iteration 12.
The video shows how the robot agent performs on new configurations which are not seen during training. The robot learns to grasp the object starting from iteration 3 but fails to successfully place and drop the object at the goal correctly. Nonetheless, after dropping the object incorrectly, the robot would attempt to pick up the object and try again. This behavior continues until iteration 12.
The video shows how the robot agent performs on new configurations which are not seen during training. At iteration 13, the robot learns a generalizable policy to perform pick and place successfully.

Figure 3: Trace rapidly learns a robot controller in the MetaWorld simulated environment, that generalizes to new initial conditions. The video shows Trace learns a policy to successfully perform the pick-place task after 13 episodes.From left to right, iteration 0, iteration 1, iteration 3, iteration 9, iteration 13.

The robot’s control policy is initialized to simply output a zero vector, which would make the robot stay at the initial configuration.
Initial control code
The control policy learned after 13 iterations is complex decision logic, with many rules to decide when to grasp, how to grasp, and when to released. The decision boundary is never told to the robot and is learned through trial and error in the environment.
Learned control code after 13 episodes 

Figure 4. Trace adapts an initial dummy control policy into a complex, generalizable control policy.

Finale: Self-adapting multi-agent LLM systems

Trace is not limited to code optimization. The Trace framework supports optimizing heterogenous parameters, including codes, prompts, and hyperparameters. Here we demonstrate Trace’s ability to optimize prompts of multiple LLM agents in solving complex household tasks in the VirtualHome (opens in new tab) simulated environment. 

Many tasks require multi-agent collaboration to solve efficiently. But crafting the right prompts for multiple LLM agents requires careful engineering. Trace can seamlessly optimize agents’ behaviors based on environmental feedback. Trace automatically constructs the interaction graph of agents and updates each agent’s behavior factoring in the behavior of other agents. Then the agents can automatically evolve to acquire specialized capabilities such as behavioral roles, freeing system designers from the painstaking process of hand-tuning multiple LLM prompts.

We use Trace and OptoPrime to improve ReAct agents that have been carefully orchestrated (opens in new tab) to complete the VirtualHome tasks. IIn each step, the agent can interact with the environment (like opening a cabinet) or send a message to another agent when they see each other. We declare the plan of each LLM-based agent (a part of their prompt) as a trainable parameter and use reward as feedback. The experimental results are shown in Figure 5 where agents optimized by Trace can complete the tasks using fewer actions and environment interactions. We observed fascinating emergent pro-social behaviors from agents without being explicitly told to communicate as illustrated in Figure 6. This pro-social interaction behavior changes with different tasks. For example, agents did not communicate with each other for the task of “book reading,” but they collaborated when asked to “put forks and plates into a dishwasher,” which we show in Figure 7. We also observed other patterns such as role specialization, where one agent became the lead in a given task, and was followed by another agent to assist.

The multi agent system optimized by Trace requires a smaller number of steps to complete each task (Read Book from 22 to 10 steps; Put Dishwasher from 21 to 19 steps; Prepare Food from 21 to 18 steps).
Figure 5: We show the number of environmental interaction actions taken to succeed in each task. Trace optimized agent takes fewer steps to succeed, thus more efficient in this environment.
The video shows example behaviors of the agents in the three tasks in VirtualHome.
The video shows example behaviors of the agents in the three tasks in VirtualHome.
The video shows example behaviors of the agents in the three tasks in VirtualHome.

Figure 6: Demo videos of how Trace agents behave to finish each of the three tasks.

[send_message]  to : I am handing you the . Please grab another piece of cutlery or plate to help! 
[send_message]  to : Can you also hand me the  you are holding?
[send_message]  to : Here's the . I'll go grab the  now. 
...
[send_message]  to : Let's head to the kitchen and put the  and  into the dishwasher.

Figure 7: Trace learns pro-social behavior in the Dishwasher task. Trace optimized agents send messages to attempt to collaborate while simple ReAct agent will only carry out the tasks.

Trace heralds a new era of interactive agents that adapt automatically using various feedback types. This innovation could be the key to unlocking the full potential of AI systems, making them more efficient and responsive than ever before. After witnessing the awesome power of Deep Neural Networks, stay tuned for the next revolution in AI design — Deep Agent Networks!

The post Tracing the path to self-adapting AI agents appeared first on Microsoft Research.

Read More

Microsoft at ICML 2024: Innovations in machine learning

Microsoft at ICML 2024: Innovations in machine learning

Microsoft at ICML 2024

In an era increasingly steered by data, machine learning is a pivotal force, transforming vast amounts of information into actionable intelligence with unprecedented speed and accuracy. For example, recent advances in machine learning have led to breakthroughs in precision health, helping doctors make more informed decisions about patient care. Similarly, in climate science, machine learning is improving scientists’ ability to predict and mitigate the impact of extreme weather events. These innovations illustrate that machine learning not only streamlines workflows, it also equips people with the tools to tackle some of today’s most pressing challenges with efficiency and innovation.

As the field continues to evolve, the International Conference on Machine Learning (ICML 2024) serves as a premier forum that showcases the latest breakthroughs and innovations, bringing together researchers, academics, and industry professionals from across the globe. Microsoft is proud to support ICML 2024 as a returning sponsor and is pleased to share that 68 papers by Microsoft researchers and their collaborators have been accepted this year, including four chosen for oral presentations.

This post highlights these presentations, each exploring machine learning’s potential to refine decision-making processes, improve automation, and model complex behaviors. A good example is NaturalSpeech 3, which introduces a new approach to speech synthesis that could transform how machines communicate. Together, these advances not only demonstrate the versatility and depth of machine learning applications, but also underscore an ongoing commitment to solving practical and theoretical challenges. Continue reading to discover more about this research and explore some of Microsoft’s contributions to ICML 2024.

Oral sessions

CompeteAI: Understanding the Competition Dynamics in Large Language Model-based Agents

Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, Xing Xie

This study aims to explore the possibilities of using LLM agents to help accelerate social science research. To that end, the authors propose a framework for studying agent competition by implementing a competitive environment, using GPT-4 to simulate a virtual town featuring restaurant and customer agents. Restaurant agents compete to attract customers, driving them to develop new operating strategies. Findings highlight phenomena such as social learning and the effect of accumulated advantage, aligning with existing sociological and economic theories. Further investigation into agent competition could enable a better understanding of society.

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

This work introduces NaturalSpeech 3, a text-to-speech (TTS) system using novel factorized diffusion models for zero-shot speech generation. First, the research team developed a neural codec with factorized vector quantization (FVQ) to separate speech waveforms into content, prosody, timbre, and acoustic details. Second, the factorized diffusion model generates attributes in each subspace based on corresponding prompts. This divide-and-conquer approach allows NaturalSpeech 3 to model intricate speech effectively and efficiently. Experimental results show that NaturalSpeech 3 surpasses state-of-the-art TTS systems in quality, similarity, prosody, and intelligibility.

Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems

Yifan Xia, Xianliang Yang, Zichuan Liu, Zhihao Liu, Lei Song, Jiang Bian

Recent advances in solving complex routing problems, like the traveling salesman problem (TSP), use a novel approach where machine learning (ML) models generate heatmaps to guide Monte Carlo tree search (MCTS) algorithms. These heatmaps indicate the likelihood of each route being part of the optimal solution. However, the authors’ analysis questions the effectiveness of ML-generated heatmaps. They found that a simple method often outperforms complex ML approaches. Additionally, the heatmap-guided MCTS is less effective than the traditional LKH-3 heuristic. The authors recommend that future research focus on better heatmap methods and more versatile ML approaches for combinatorial problems. 

PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control

Ruijie Zheng, Ching-An Cheng, Hal Daumé III, Furong Huang, Andrey Kolobov

Temporal action abstractions promise more effective AI decision-making and data-efficient training of large robotic models. This work draws a novel analogy between temporal action abstraction and text tokenization—a seemingly unrelated sequential data compression mechanism in LLMs typically implemented using byte pair encoding (BPE). Based on this, the authors propose Primitive Sequence Encoding (PRISE), an approach that combines action quantization with BPE for skill learning for continuous control. Results show that high-level skills learned by PRISE from robotic manipulation demonstrations greatly improve behavior cloning performance in downstream tasks.


Discover more about our work and contributions to ICML 2024, including our full list of publications and sessions, on our conference webpage.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience



The post Microsoft at ICML 2024: Innovations in machine learning appeared first on Microsoft Research.

Read More

Abstracts: July 18, 2024

Abstracts: July 18, 2024

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researcher Arindam Mitra joins host Gretchen Huizinga to discuss “AgentInstruct: Toward Generative Teaching with Agentic Flows.” In their paper, Mitra and his coauthors introduce an automated multi-agent framework for creating diverse, high-quality synthetic data at scale for language model post-training. In contrast to methods that create data from a seed set of existing prompts and responses, AgentInstruct uses raw data and specifications provided by model builders. The work—which post-trains a model, Orca-3, on AgentInstruct-generated data—is part of project Orca. Orca aims to develop techniques for creating small language models that can perform as well as large language models. Like Orca-3, the earlier Orca, Orca-2, and Orca-Math models show the effectiveness of leveraging synthetic data in training. 

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m here today with Dr. Arindam Mitra, a senior researcher at Microsoft Research and the lead researcher for Microsoft’s Orca project. Dr. Mitra is coauthor of a paper called “AgentInstruct: Toward Generative Teaching with Agentic Flows.” Arindam, it’s a pleasure to have you on Abstracts today.

ARINDAM MITRA: Thank you, Gretchen.

HUIZINGA: So let’s start with a brief overview of your paper. What problem does your research address, and why does it matter?


MITRA: So the post-training phase is very important for language models. You can really improve the model a lot by creating high-quality synthetic data. The problem is, however, though, high-quality synthetic data creation requires lots of human effort and expertise. The problem that we’re trying to tackle is, how do you reduce human effort? How can you create high-quality data with really low amount of human effort? When you have a language model and, let’s say, you want to apply it somewhere, you might have to train a generic model before. Which could be small or big. Doesn’t matter. After that, you can specialize it on the domain that you are looking for, and when you want to do that—to make it really fast, this particular process—it’s best if you go for synthetic data. If you have a way to, actually, generate very high-quality synthetic data, you can fast-track this part of specialization process. Not only single model. So this year, you’re going to see a lot more multi-agent models. And when you are trying to build these multi-agent models, you’re fearing like, OK, it might increase the cost too much, the latency too much. So it’s also very much important that you have a multi-agent system and you can, sort of, replace some of those agents with specialized small models. And when you’re trying to address these goals, you want this process to be something which you know works fast. So that’s why we are trying to make sure we have a very good way to create synthetic data for your specific need.

HUIZINGA: No research exists in a vacuum, and most of it fills some kind of a gap. So tell us what’s already been done in this field and how this work is building on it.

MITRA: So previously, actually, we have seen that in post-training, the more data you have, the better the performance goes for the model you’re training. So what we wanted to test is how much we can scale and what happens if we scale a lot and lot. But we didn’t have the tools for it. So the other approaches people previously used was you had a small set of data and how do we expand this dataset into much larger and larger amount of data. That’s where people were mostly focusing. But it’s not that easy to create that initial seed set. [LAUGHTER] You need to be very expert. The way that we’re doing is, actually, rather you define what you want to create. Like, OK, you want to create tool-use data. So you say, OK, I have a bunch of tools, and I am looking for data in the scenarios where someone can just come give me a description and then maybe that person interact with the AI to figure out how to get the job done. It’s not a one-step thing. And maybe you also have a setting where it’s more like an app developer. You have a bunch of APIs in your phone. You just want to figure out which one is best for the user request, which came through voice command. So different scenarios could be there. So what we’re saying [is], OK, we are not going through the method where you have to come up with your initial own seed data and then we expand. It is more like you define what you want to do. It’s much more abstract. And then, we are, sort of, automating the effort of data creation. So this setting actually of synthetic data creation, we are referring [to] it as generative teaching, and that’s where we are, sort of, differing. So previously, it was more like expansion, and now we are trying from specification to the data that you need.

HUIZINGA: Gotcha. Well talk a little bit more about your methodology and how you went about conducting this research.

MITRA: So first of all, what we are proposing actually is a multi-agent solution. So you start with first describing what you really need. So you describe in detail, like, I need data for this specific skill or this specific scenario. Then, what we do is like, OK, you have some unstructured data or raw data like text documents or code files that you gather from web with permissible license or use something that you own. We don’t care much about what the content is really. So it’s more like we got some random stuff, some random content. And then we’ll guide you how to convert this random something which is not meaningful for you into something which is meaningful for your data creation. For example, like, if you are creating data to teach how to use APIs, you might think about, you need lots of APIs and how do you get these APIs. So what we are saying is, like, we can take something like code and we’ll have agents which will convert these raw code files into list of APIs which is more like a library. So you create automatically this input that is very meaningful for data creation. And then once we have that, we have basically the seed instruction creation step based on your specification. Like, what do you want to create data for? So you have all these different scenarios, and we have multiple agents creating data for different scenarios. And then the last step is actually what we call refinement step. So it’s more like whatever data you created, we’ll go through them and we’ll make them better and better—improve the quality, improve the complexity, improve the trickiness, we’ll teach when not to answer, etc., etc. So make sure we cover the whole space. So by changing the stochastic seed, we are trying to cover the entire possible data space.

HUIZINGA: Right.

MITRA: So that’s the key thing. The way we, sort of, conducted this research is actually we defined 17 skills. Skills meaning reading comprehension, tool use, text modification, content creation, RAG (retrieval-augmented generation) … we have, like, list of 17 skills … conversation … and then we created one multi-agent flow for each of the skills and we generate data. So one key thing I want to highlight is, like, this work, compared to other work, it was not benchmark driven. We want to teach a skill. We don’t care which benchmarks we’re trying to evaluate it on. So we define the skill, like tool use means this to us, reading comprehension means this to us, text modification means this to us. And then we, sort of, generate the data to teach everything for that skill. And then what we did, we created actually 22 million instructions. And we had previously in Orca series, we had 3 million, around, instructions. So the 25 million is what we, sort of, have at the end. And that’s where we actually trained a Mistral model as of now. And we’re going to measure, like, how much we improve the Mistral model by this post-training.

HUIZINGA: Moving from methods to findings, I always look forward to the part of the research paper that finishes the sentence “and what we found was … ,” so give us a quick overview of your results. What did you find?

MITRA: Yes, so the results were actually very exciting for us. So Mistral 7B was our main, sort of, baseline because that’s where we’re trying to showcase, like, how much improvement we are getting. On the other side, we have, like, frontier models—ChatGPT, GPT-4. We want to also measure how far we are from those frontier models, so that’s, sort of, our evaluation setup. So on average actually, we got like 20 percent performance gain over the Mistral, and we evaluated that across 14 benchmarks that test reasoning, content creation, instruction following, format following, etc. But what was more important to us was to do a skill-specific evaluation because we are trying to teach certain skills, and we had, like, 17 skills as we mentioned earlier. So, for example, like, if you are focusing on reading comprehension as a skill, we took LSAT, SAT, and DROP, and many other benchmarks; we created a collection of reading comprehension-based benchmark. And there, we are observing, like, 20 percent improvement over Mistral, and what it means, like, we’re actually achieving GPT-4–level performance. Similarly, if I’m focusing on math skill, there are many datasets which test, like, elementary math, high school math, college-level math. And we improved actually across all these different levels of math. So we see from 40 percent to 150 percent of improvement on different benchmarks of math. So it was more like what we wanted to see. We’re not optimizing for a particular benchmark. We wanted to optimize the skill, and that’s what you’re observing. So you’re observing improvement in math across all these levels, from elementary to high school to college to middle school, etc., everything. The same goes for RAG, as well. We’re observing on RAG skill 92 percent, around, improvement over Mistral. The format following numbers are pretty interesting to us. So format following is very important for SLMs (small language models). You want to make these models practical. You want to make sure that they follow the format so you can parse the result. And we were able to take Mistral beyond Gemini Pro. So that was a very strong performance from the post-training that we did. For summarization, actually we were able to reduce the hallucination rate by 31 percent while achieving the GPT-4–level quality. So overall, all these results were, sort of, highlighting that the methodology that we have, which we’re calling AgentInstruct, is very promising.

HUIZINGA: I think it’s important to get practical and talk about real-world impact. So tell us who you think this research will benefit most and why.

MITRA: Yeah, so again the model builders will, sort of, find it most beneficial. So the significance of our work actually lies in the way we are trying to revolutionize the language model development through scalable, low-effort synthetic creation. And the scalable and low effort is, sort of, the key thing, right. We have shown that we can create very high-quality data. That’s what the numbers are telling us. We want to mention that this is very scalable and low effort, and that’s what we think might help the most for model builders.

HUIZINGA: So, Arindam, let’s borrow a phrase from the machine learning lexicon and go for a little one-shot learning here: if you had to boil down why your work is important, what’s the one thing you want our listeners to take away from this research?

MITRA: The key takeaway would be, like, the AgentInstruct method enables the generation of vast, diverse, and high-quality synthetic data with very minimal human input. So that’s one thing I would, like, to remember from this paper.

HUIZINGA: So as we close, talk briefly about the limitations that you encountered in this project and directions for future research. What are the outstanding challenges in this field, and what’s on your research agenda to overcome them?

MITRA: Yes, so we’re exploring further automation. But apart from making this data creation more automated and less human involvement needed, we’re trying to focus on two other aspects. One is automated model debugging, and the other is automated model repairing. So now that we have the ability to generate data for a particular skill, let’s say math, for model debugging, what we need is basically an error handler. Like something we can plug in which takes the question and the answer coming from a different model and verifies if the answer is correct or not. So that’s the part we’re working on right now, figuring out this error handler. And the second aspect is repairing. So once we have the error, we figure out, OK, this is where the model is struggling. How can we give feedback or how can we give more knowledge so it can basically correct those errors? So those are some things we’re working on right now.

[MUSIC PLAYS]

HUIZINGA: Well, Arindam Mitra, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find a preprint on arXiv. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: July 18, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of July 15, 2024

Research Focus: Week of July 15, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: July 15, 2024

MG-TSD: Advancing time series analysis with multi-granularity guided diffusion model

Diffusion probabilistic models have the capacity to generate high-fidelity samples for generative time series forecasting. However, they also present issues of instability due to their stochastic nature. In a recent article: MG-TSD: Advancing time series analysis with multi-granularity guided diffusion model, researchers from Microsoft present MG-TSD, a novel approach aimed at tackling this challenge.

The MG-TSD model employs multiple granularity levels within data to guide the learning process of diffusion models, yielding remarkable outcomes without the necessity of additional data. In the field of long-term forecasting, the researchers have established a new state-of-the-art methodology that demonstrates a notable relative improvement across six benchmarks, ranging from 4.7% to 35.8%.

The paper introducing this research: MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process(opens in new tab) (opens in new tab), was presented at ICLR 2024 (opens in new tab).


Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Machine learning applications based on large language models (LLMs) have been widely deployed in consumer products. Increasing the model size and its training dataset have played an important role in this process. Since larger model size can bring higher model accuracy, it is likely that future models will also grow in size, which vastly increases the computational and memory requirements of LLMs.

Mixture-of-Experts (MoE) architecture, which can increase model size without proportionally increasing computational requirements, was designed to address this challenge. Unfortunately, MoE’s high memory demands and dynamic activation of sparse experts restrict its applicability to real-world problems. Previous solutions that offload MoE’s memory-hungry expert parameters to central processing unit (CPU) memory fall short.

In a recent paper: Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference, researchers from Microsoft address these challenges using algorithm-system co-design. Pre-gated MoE alleviates the dynamic nature of sparse expert activation, addressing the large memory footprint of MoEs while also sustaining high performance. The researchers demonstrate that pre-gated MoE improves performance, reduces graphics processing unit (GPU) memory consumption, and maintains model quality.

Spotlight: Event

Inclusive Digital Maker Futures for Children via Physical Computing

This workshop will bring together researchers and educators to imagine a future of low-cost, widely available digital making for children, both within the STEAM classroom and beyond.


What Matters in a Measure? A Perspective from Large-Scale Search Evaluation

Evaluation is a crucial aspect of information retrieval (IR) and has been thoroughly studied by academic and professional researchers for decades. Much of the research literature discusses techniques to produce a single number, reflecting the system’s performance: precision or cumulative gain, for example, or dozens of alternatives. Those techniques—metrics—are themselves evaluated, commonly by reference to sensitivity and validity.

To measure search in industry settings, many other aspects must be considered. For example, how much a metric costs; how robust it is to the happenstance of sampling; whether it is debuggable; and what is incentivized when a metric is taken as a goal. In a recent paper: What Matters in a Measure? A Perspective from Large-Scale Search Evaluation, researchers from Microsoft discuss what makes a search metric successful in large-scale settings, including factors which are not often canvassed in IR research, but which are important in “real-world” use. The researchers illustrate this discussion with examples from industrial settings and elsewhere and offer suggestions for metrics as part of a working system.


LordNet: An efficient neural network for learning to solve parametric partial differential equations without simulated data

Partial differential equations (PDEs) are ubiquitous in mathematically-oriented scientific fields, such as physics and engineering. The ability to solve PDEs accurately and efficiently can empower deep understanding of the physical world. However, in many complex PDE systems, traditional solvers are too time-consuming. Recently, deep learning-based methods including neural operators have been successfully used to provide faster PDE solvers through approximating or enhancing conventional ones. However, this requires a large amount of simulated data, which can be costly to collect. This can be avoided by learning physics from the physics-constrained loss, also known as mean squared residual (MSR) loss constructed by the discretized PDE.

In a recent paper: LordNet: An efficient neural network for learning to solve parametric partial differential equations without simulated data, researchers from Microsoft investigate the physical information in the MSR loss, or long-range entanglements. They identify the challenge: the neural network must model the long-range entanglements in the spatial domain of the PDE, whose patterns vary. To tackle the challenge, they propose LordNet, a tunable and efficient neural network for modeling various entanglements. Their tests show that Lordnet can be 40× faster than traditional PDE solvers. In addition, LordNet outperforms other modern neural network architectures in accuracy and efficiency with the smallest parameter size.


FXAM: A unified and fast interpretable model for predictive analytics

Generalized additive model (GAM) is a standard for interpretability. However, due to the one-to-many and many-to-one phenomena which appear commonly in real-world scenarios, existing GAMs have limitations to serving predictive analytics in terms of both accuracy and training efficiency. In a recent paper: FXAM: A unified and fast interpretable model for predictive analytics, researchers from Microsoft propose FXAM (Fast and eXplainable Additive Model), a unified and fast interpretable model for predictive analytics. FXAM extends GAM’s modeling capability with a unified additive model for numerical, categorical, and temporal features. FXAM conducts a novel training procedure called three-stage iteration (TSI). TSI corresponds to learning over numerical, categorical, and temporal features respectively. Each stage learns a local optimum by fixing the parameters of other stages. The researchers design joint learning over categorical features and partial learning over temporal features to achieve high accuracy and training efficiency. They show that TSI is mathematically guaranteed to converge to the global optimum. They further propose a set of optimization techniques to speed up FXAM’s training algorithm to meet the needs of interactive analysis.

Microsoft Research in the news


Sriram Rajamani at Microsoft Research on AI and deep tech in India 

Fobes India | June 28, 2024

Sriram K Rajamani, managing director of Microsoft Research India Lab, reflects on computer science and engineering research, including how AI and LLMs can help solve local needs. Rajamani also discusses the technical aspects of how modern AI models work, and best practices from the research lab that could apply to India’s deep tech ecosystem.

The post Research Focus: Week of July 15, 2024 appeared first on Microsoft Research.

Read More

Data-driven model improves accuracy in predicting EV battery degradation

Data-driven model improves accuracy in predicting EV battery degradation

white icons symbolizing renewable electric energy on a blue and green gradient background

Rising carbon emissions have significantly challenged sustainable development in recent years, prompting global efforts to implement carbon reduction policies and achieve long-term carbon neutrality. A crucial step in this transition involves the recycling and reuse of power batteries, which are assessed for their state-of-health (SoH) and then repaired or restructured for reuse in smaller-sized electric vehicles (EVs), energy storage systems, and smart streetlights. This process not only extends battery life but also maximizes their residual value. However, accurately assessing this value is complex.  

To address this, Microsoft Research collaborated with Nissan Motor Corporation to develop a new machine learning method that predicts battery degradation with an average error rate of just 0.94%, significantly bolstering Nissan’s battery recycling efforts.

Approaching carbon neutrality, one step at a time 

Nissan, the company that launched the world’s first mass-produced electric vehicle, has long been committed to reducing carbon emissions. In 2021, Nissan announced its goal to achieve carbon neutrality by 2050 throughout the vehicle’s lifecycle. Central to this effort is the management and innovation of batteries, the key power source for electric vehicles, making battery recycling is an important part of this initiative. 

The graph overviews Nissan’s Challenges in Battery Eco-cycle Innovation. The image is segmented into four quadrants, each representing a crucial phase in the battery life cycle. The top left quadrant, “Data-driven chemistry design” and the top right, “Cell design optimization” are integral to the development phase of Battery DX. The bottom right quadrant focuses on “Battery diagnosis/prognosis” which is essential for Battery DX during its use. Lastly, the bottom left quadrant, “Material recycle” emphasizes the importance of recycling in the eco-cycle.
Figure 1. The challenges faced by Nissan in battery eco-cycle innovation

Atsushi Ohma, Expert Leader of the EV System Laboratory at Nissan, noted that EVs and their batteries currently have an average lifecycle of about 10 years, contributing to approximately 50% of their CO2 emissions in the material mining and manufacturing process. Nissan aims to extend the lifecycle of EVs and batteries to more than 15 years, reducing CO2 emissions. To achieve this, the company hopes to leverage technologies like AI and big data to drive innovation in battery and electric vehicle development.

Flowchart showing the life cycle of electric vehicle (EV) batteries and their impact on CO2 emissions. It outlines stages such as raw material mining, battery production, usage in vehicles, and recycling/repurposing processes. The chart shows that about 50% of life cycle CO2 emissions are from raw material mining and battery production, and emphasizes that Nissan aims to extend the lifespan of electric vehicles and batteries by 15 or 20 years to reduce CO2 emissions.
Figure 2. Vision for reducing CO2 in the EV lifecycle

Collaborating to reduce CO2 in the EV lifecycle

Since Microsoft announced its sustainability commitments and outlined plans to work toward a more sustainable future in 2020, the team at Microsoft Research Asia has been actively engaged in addressing sustainability challenges through interdisciplinary research, collaborating with partners from related fields. The team has already developed BatteryML, an open-source machine learning tool for advancing battery research, and is working on methods to predict battery health and remaining service life. This makes the collaboration between Microsoft Research Asia and Nissan a natural one. Together, the joint team aims to achieve carbon neutrality and enhance lithium-ion battery performance prediction by focusing on battery performance degradation. 

photo of Atsushi Ohma, Expert Leader, EV System Laboratory, Research Division, Nissan

“Through our collaboration with Microsoft Research Asia, we are innovating battery degradation prediction methods to enhance the effectiveness of battery recycling and promote resource reuse. This is a pivotal step in our journey towards achieving long-term carbon neutrality. We call it ‘thinking big and starting with small steps.’”

Atsushi Ohma, Expert Leader, EV System Laboratory, Research Division, Nissan

Enhancing battery predictions with speed and accuracy

Understanding the SoH of batteries is crucial for efficient battery recycling. While usable capacity does not fully represent SoH, more important factors include the integrity of the battery’s chemistry over its life, such as the levels of lithium, cobalt, and nickel. Traditionally, battery degradation prediction relies on mathematical models based on chemical, electrochemical, and mechanical principles. This method requires continuous experimentation to adjust parameters, involving lengthy processes like battery disassembly and analysis, which can take six months to a year. Additionally, further experimentation and parameterization are needed whenever the chemistry changes. To address this, Nissan aims to apply machine learning to predict battery health based on external signals, minimizing the need for extensive physical testing. 

However, there are two main challenges to using machine learning to predict battery performance. First, it’s difficult to gather sufficient data due to the lengthy charging and discharging cycles. Second, because batteries operate under varying conditions, signal acquisition is complicated. Additionally, external environmental factors can influence battery capacity without directly reflecting its health status.

To filter out this “noise” and identify patterns that accurately reflect the battery’s internal condition, researchers have developed specialized features to analyze how the internal chemistry of lithium-ion batteries changes under different voltage and current conditions. By integrating these key features with real Nissan data, researchers improved the prediction accuracy of their machine learning models. 

photo of Shun Zheng, Senior Researcher, Microsoft Research Asia

“We found differences between academic public datasets and real-world corporate data. Models built on academic datasets are difficult to apply in enterprise settings due to variations in data patterns, testing conditions, and prediction goals. Developing broadly applicable models for industry requires integrating proprietary enterprise data with advanced AI technologies.”

Shun Zheng, Senior Researcher, Microsoft Research Asia

Data-driven model boosts accuracy by 80% in simulations

The machine learning methodology redefines the entire feature space to provide a comprehensive understanding of battery degradation. Advanced feature engineering analyzes diversified features derived from degradation patterns in voltage-capacity curves during charging and discharging cycles, as illustrated in Figure 3. Researchers focused on distinguishing information between high and low voltage intervals, including first-order and higher-order differences as effective indicators of battery health, enhancing predictive power and providing deep insights into battery performance and longevity.

The graph depicts discharge capacity of a battery cell at a specific voltage during the 50th cycle. The x-axis is labeled “Capacity [mAh/g]” and the y-axis “Cell Voltage [V]”. A descending line graph illustrates the relationship between cell voltage and capacity, with a highlighted point “Q^d (Vx)” representing the discharge capacity at that voltage during the 50th cycle. The accompanying text shows that this method is more accurate than using “Var(Δ_x-0 * Q^d)”.
Figure 3. Feature engineering, demonstrating the variation of voltage with respect to discharge capacity.

Compared with popular state-of-the-art battery prediction methods, this data-driven model improves accuracy by approximately 80% with Nissan’s simulation data and by over 30% with real-world experimental data. The new method has achieved a mean absolute error (MAE) of 0.0094 in predicting SoH at the 200th cycle using data from only the first 50 cycles, as shown in Figure 4.

This demonstrates that the new data-driven model is not only more accurate but also more efficient in predicting a battery’s SoH compared with existing methods. It requires less data and fewer cycles to make precise predictions, offering significant advantages for battery health monitoring and management.

Four graphs. The top left graph is a scatter plot with blue and red dots representing “Train” and “Test” data sets, respectively, showing a strong correlation between prediction and experiment. The top right graph displays two bar graphs for Mean Absolute Error (MAE) with values for “Train” and “Test”, the MAE of “Train” is 0.0077, the MAE of “Test” is 0.0094. Below is a box plot labeled “TEST MAE” across different “Qd (V)x” values, indicating the model’s accuracy at various stages. The image demonstrates the model’s effectiveness in predicting battery performance.
Figure 4. Test achieves MAE of 0.0094 in predicting SOH at the 200th cycle using Qd (V)50

By employing the data-driven method, researchers discovered that the indicated feature at 3.9 volts can be interpreted as the nickel manganese cobalt oxide (NMC) crystalline structure (M->H2). This finding aligns with electrochemical research and highlights that the features identified through our data-driven approach have significant real-world implications for understanding battery degradation.

photo of Jungwon Moon, Engineer, EV System Laboratory Research Division, Nissan

“This research extends the lifespan of power batteries in two ways: first, by improving reuse potential and accurately determining their remaining lifespan; and second, by developing effective recycling strategies for retired batteries. The unique approach of our joint research was to predict not only cell SoH but also cathode (NMC) SoH to improve the reliability of the cell SoH prediction model. It was surprising that the high sensitivity to certain voltages (3.9V) indicated by the data-driven cathode (NMC) SoH prediction model aligns with results from the physics-based method. Collaboration with Microsoft Research Asia has demonstrated that AI can be applied to battery manufacturing, including material selection and process optimization.”

Jungwon Moon, Engineer, EV System Laboratory Research Division, Nissan

Looking ahead: Exploring AI’s sustainability applications

The collaboration between Nissan and Microsoft Research Asia highlights the potential of AI technologies, including machine learning and deep learning, in the EV sector. Beyond predicting battery health for recycling, AI can optimize the driving experience by accurately predicting battery life and enabling smarter driving. Additionally, AI holds promise for discovering new materials and driving innovation in battery and EV technology.

photo of Jiang Bian, Senior Principal Researcher, Microsoft Research Asia

“There are existing issues with lithium batteries. We need batteries with high energy density, good safety, a long lifecycle, and with a minimal environmental impact. Through our collaboration with Nissan, we have learned that AI has great potential in the EV, including optimizing battery material combinations to improve performance, discovering new materials, and optimizing battery electrode processes. In the future, we hope to collaborate with more industry partners to further explore AI’s potential in various industrial applications.”

Jiang Bian, Senior Principal Researcher, Microsoft Research Asia

Building on their initial results, Nissan and Microsoft Research Asia plan to expand their collaboration to further advance technology and accelerate progress toward sustainable development and environmental protection goals.

Seven people posed for a group photo in front of the wall banner of Microsoft Research Asia when Atsushi Ohma visited in June 2024.
Figure 5. Atsushi Ohma from Nissan, center, visited Microsoft Research Asia in June 2024

The post Data-driven model improves accuracy in predicting EV battery degradation appeared first on Microsoft Research.

Read More

RUBICON: Evaluating conversations between humans and AI systems

RUBICON: Evaluating conversations between humans and AI systems

This paper has been accepted at the 1st ACM International Conference on AI-powered Software (opens in new tab) (AIware 2024), co-located with FSE 2024 (opens in new tab). AIware is the premier international forum on AI-powered software.

Rubicon paper at Alware 2024

Generative AI has redefined the landscape of AI assistants in software development, with innovations like GitHub Copilot providing real-time, chat-based programming support. As these tools increase in sophistication and domain specialization, assessing their impact on user interactions becomes more challenging. Developers frequently question whether modifications to their AI assistants genuinely improve the user experience, as indicated in a recent paper.

Traditional feedback mechanisms, such as simple thumbs-up or thumbs-down ratings, fall short in capturing the complexities of interactions within specialized settings, where nuanced data is often sparse. To address this issue, we introduce RUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations,” presented at AIware 2024. RUBICON is an automated assessment technique that transforms a minimal dataset into an extensive array of domain-specific rubrics, helping ensure that updates not only modify but meaningfully improve user interactions.

Foundational communication principles

Effective conversation, whether human-to-human or human-to-AI, adheres to four maxims (opens in new tab) outlined by philosopher Paul Grice: quantity, quality, relation, and manner, ensuring that communication is concise, truthful, pertinent, and clear. In AI applications, they help create interactions that feel natural and engaging, fostering trust and empathy. Within domain-specific settings, RUBICON adapts these principles to ensure they are context-aware, improving the utility and clarity of interactions. For example, in Visual Studio, the AI helps the developer debug a program by providing detailed explanations and relevant code examples, shown in Figure 1. In Figure 2, its responses reflect that it’s guided by context.

In the image, we see two Human-AI debugging conversations side by side, both working on the same task but with different AI assistants. On the left side, the assistant suggests using an if-else block to catch and throw an exception. The user responds that they do not want to throw any exceptions. The assistant then proposes a try-catch block instead. The user ends the conversation by asking how to prevent the exception from occurring in the first place. The assistant makes assumptions without clarifying details about the scenario, leading to a superficial and unusable fix. On the right side, the assistant starts by asking the user to check a variable's value at a specific state. The user replies that the variable is empty. The assistant then forms a hypothesis and requests the relevant code file from the user. After receiving the code, the assistant provides a simple fix. The user ends the conversation by confirming that the solution worked. Here, the assistant actively investigates the error, collaborates with the user to gather information, and delivers a practical solution.
Figure 1. Contrasting interactions with two versions of the Visual Studio Debugging Assistant for the same task. On the left, the assistant makes assumptions without seeking clarification. On the right, the assistant proactively investigates the error, collaborates with the developer to gather essential information, and achieves a practical solution.
In the image, there are two sample initial responses to the same task by different debugging assistants, shown side by side. On the left, the assistant merely reiterates the meaning of the exception message and gives generic advice, such as asking the user to check why the serialization failed. On the right, the assistant identifies the probable source of the error, points out the specific method to the user, and requests the user to provide the code for that method.
Figure 2. Context awareness significantly improves the AI assistant’s efficacy. The response on the left is generic, superficially referring to the developer’s code and restating the obvious, providing little value. The reply on the right directs the developer toward a specific solution, the toJSON method.

In task-oriented environments, it’s important to assess how well a conversation aligns with user expectations and assists in achieving their goals. Conversations are only useful if they advance the user’s interests, and challenges can arise when users have misaligned expectations of the AI’s capabilities or when the AI directs the conversation too forcefully, prioritizing its methods over the user’s preferences. RUBICON balances the interaction dynamics between the AI and developer, promoting constructive exchanges without overwhelming or under-engaging. It calibrates the extent to which the AI should hypothesize and resolve issues versus how much it should leave to the developer.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch the first three episodes on demand.


RUBICON’s rubric-based method and evaluation

RUBICON is built on the foundational work of SPUR—the Supervised Prompting for User Satisfaction Rubrics framework that was recently introduced—increasing its scope and crafting a broad spectrum of potential rubrics from each batch of data. Using a language model to create concise summaries that assess the quality of conversations, emphasizing communication principles, task orientation, and domain specificity. It identifies signals of user satisfaction and outlines the shared responsibilities of the user and the AI in achieving task objectives. These summaries are then refined into rubrics.

RUBICON’s novel selection algorithm sifts through numerous candidates to identify a select group of high-quality rubrics, enhancing their predictive accuracy in practical applications, as illustrated in Figure 3. The technique doesn’t require human intervention and can be trained directly on anonymized conversational data, helping to ensure customer data privacy while still extracting the important features for analysis.

The image contains three graphics. On the left is a bad Human-AI debugging conversation, and on the right is a good one. The center graphic lists sample rubrics generated by RUBICON from events of goodness/badness from both the conversations. Arrows connect specific events in the conversations to the corresponding rubric. For example, one arrow starts from the part of the right conversation where the assistant provides a ready-to-use code snippet to solve the bug, ending at the rubric, “The assistant provides a code snippet to illustrate the solution, aiding the user in implementing the fix.”
Figure 3. Overview of RUBICON’s framework and the various steps involved.

The effectiveness of RUBICON’s method is evidenced by its rubrics, which show an 18% increase in accuracy over SPUR in classifying conversations as positive or negative, as shown in Figure 4. Additionally, RUBICON achieves near-perfect precision in predicting conversation labels in 84% of cases involving unlabeled data.

The image depicts a workflow illustrating the RUBICON technique. It begins with a set of conversations, from which signals indicating conversation quality are extracted. An LLM then analyzes these signals, reasoning about why they occurred, using domain-specific insights and understanding of the user-assistant interaction. Another LLM summarizes these reasonings into a rubric pool, applying Gricean maxims to evaluate conversational situations. Finally, RUBICON’s novel selection policy algorithm selects the top-performing rubric from this pool.
Figure 4. Two analogous conversations facilitated by the Debugger AI assistant are evaluated against representative rubrics. Software engineers who evaluated the conversations found the one on the left less effective and the one on the right more so. RUBICON’s rubric also gave a higher score to the conversation on the right, demonstrating that RUBICON’s method of evaluation is consistent with that of the software engineers.

RUBICON-generated rubrics 

RUBICON-generated rubrics serve as a framework for understanding user needs, expectations, and conversational norms. These rubrics have been successfully implemented in Visual Studio IDE, where they have guided analysis of over 12,000 debugging conversations, offering valuable insights into the effectiveness of modifications made to the assistant and facilitating rapid fast iteration and improvement. For example, the rubrics The AI gave a solution too quickly, rather than asking the user for more information and trying to find the root cause of the issue,” or “The AI gave a mostly surface-level solution to the problem,” have indicated issues where the assistant prematurely offered solutions without gathering sufficient information. These findings led to adjustments in the AI’s behavior, making it more investigative and collaborative. 

Beyond conversational dynamics, the rubrics also identify systemic design flaws not directly tied to the conversational assistant. These include issues with the user interface issues that impede the integration of new code and gaps in user education regarding the assistant’s capabilities. To use RUBICON, developers need a small set of labeled conversations from their AI assistant and specifically designed prompts that reflect the criteria for task progression and completion. The methodology and example of these rubrics are detailed in the paper.

Implications and looking ahead

Developers of AI assistance value clear insights into the performance of their interfaces. RUBICON represents a valuable step toward developing a refined evaluation system that is sensitive to domain-specific tasks, adaptable to changing usage patterns, efficient, easy-to-implement, and privacy-conscious. A robust evaluation system like RUBICON can help to improve the quality of these tools without compromising user privacy or data security. As we look ahead, our goal is to broaden the applicability of RUBICON beyond just debugging in AI assistants like GitHub Copilot. We aim to support additional tasks like migration and scaffolding within IDEs, extending its utility to other chat-based Copilot experiences across various products.

The post RUBICON: Evaluating conversations between humans and AI systems appeared first on Microsoft Research.

Read More

Collaborators: Sustainable electronics with Jake Smith and Aniruddh Vashisth

Collaborators: Sustainable electronics with Jake Smith and Aniruddh Vashisth

photos of Jake Smith and Aniruddh Vashisth for the Microsoft Research Collaborators podcast

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

Printed circuit boards (PCBs) are abundant—in the items we use daily and then in landfills when they’ve reached end of life. In this episode, Senior Researcher Jake Smith (opens in new tab) and Aniruddh Vashisth (opens in new tab), assistant professor of mechanical engineering at the University of Washington, join host Gretchen Huizinga to talk about the development of vitrimer-based PCBs, or vPCBs, that perform comparably to traditional circuit boards but have less environmental impact. Smith and Vashisth explore machine learning’s role in accelerating the discovery of more sustainable materials and what the more healable vitrimer polymer could mean not only for e-waste but more broadly for aerospace, the automotive industry, and beyond.

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

ANIRUDDH VASHISTH: From the computation point of view, we always thought that if somebody gave us, like, a hundred different chemistries, we can do a bunch of simulations; tell you, like, 10 of these actually work. What we’ve been able to do specifically for vitrimers is that we’re able to look at the problem from the other side, and we are able to say that if you tell me a particular application, this particular chemistry would work best for you. In essence, what we were thinking of is that if aliens abducted all the chemists from the world, can we actually come up with a framework? [LAUGHTER]

JAKE SMITH: If all of this work is successful, in 10 years, maybe our materials design process looks completely different, where we’ve gone from this kind of brute-force screening to an approach where you start with the properties that you care about—they’re defined by the application that you have in mind—and we use this, like, “need space” to define the material that we would like, and we can use machine learning, artificial intelligence, in order to get us to the structure that we need to make in order to actually achieve this design space.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC FADES]


I’m thrilled to be in the booth today, IRL, with Dr. Jake Smith, a senior researcher at Microsoft Research and part of the Microsoft Climate Research Initiative, or MCRI. And with him is Dr. Aniruddh Vashisth. He’s an assistant professor of mechanical engineering at the University of Washington and director of the Vashisth Research Lab. Jake and Aniruddh are working on a project that uses machine learning to help scientists design sustainable polymers with a particularly exciting application in the field of the ubiquitous printed circuit board, or PCB. But before we get all sustainable, let’s meet our collaborators!

Jake, I’ll start with you. You’re a self-described “chemist with relatively broad interests across applications” and you’ve done some pretty cool things in your career. Tell us about those interests and where they’ve led you and how they’ve contributed to the work you’re doing now in MCRI, or the Microsoft Climate Research Initiative.

JAKE SMITH: Yes. Thank you very much for having me. So I started, like most chemists, poking things around in the lab and learning really fundamentally about how atoms interact with one another and how this affects what we do or what we see at our microscopic level. And so after I left grad school doing this super-basic research, I wanted to do something more applied, and so I did a couple of postdocs, first, looking at how we can more effectively modify proteins after we’ve synthesized them so they might have a property that we care about and then later doing similar work on small molecules in a more traditional drug-design sense. But after I finished that, I wound up here at Microsoft. We were very interested in one molecule in particular, one family of molecules, which is DNA, and we wanted to know, how do we make DNA at just gigantic scale so that we can take that DNA and we could store digital data in it? And because DNA has this nice property that it kind of lasts forever, …

HUIZINGA: Yeah.

SMITH: … at least on our, you know, human scale, it makes a very, you know, nice archival storage medium. So we worked on this project for a while, and at some point, we determined we can, kind of, watch it blossom and find the next challenge to go work on.

HUIZINGA: Interesting …

SMITH: And the challenge that we, you know, wound up at I’ll describe as the Microsoft Climate Research Initiative, the MCRI. We were a group of applied scientists from, like, natural scientist backgrounds within Microsoft, and we said, how can we make a difference for Microsoft? And the difference that we thought was Microsoft has climate goals.

HUIZINGA: Oh, yeah!

SMITH: Microsoft wants to be carbon negative, it wants to be water positive, and it wants to be zero waste. And in order to make this happen, we need novel materials, which really are a macroscopic view of, once again, atomic behavior. And we said, hey, we understand atomic behavior. We’re interested in this.

HUIZINGA: [LAUGHS] We can help! We’re from the government …

SMITH: Yeah, maybe this is something we could help on. Yeah. And so here we are. We wound up with Aniruddh, and we’ll go into that later, I’m sure.

HUIZINGA: Yeah, yeah. So just quickly back to the DNA thing. Was that another collaboration? I had Karin Strauss on the podcast a while ago, and she talked about that.

SMITH: Oh, absolutely. Yeah, this was with Karin, and we had great collaborators, also at the University of Washington in the Molecular Information Systems Lab, or MISL, who did a lot of work with us on the practicalities of working with DNA once it’s synthesized and how would you do things like retrieve information from a big pool of DNA.

HUIZINGA: Right. Right. They could … people could go back to that podcast because she does unpack that quite a bit. Well, Aniruddh, you describe yourself as a “trained mechanician who hangs out with chemists,” hence your friendship with Jake here, but for your day job, you’re a professor and you have your own lab that conducts interdisciplinary research at the intersection, as you say, of mechanics and material science. So what made you want to move to that neighborhood, and what goes on there?

ANIRUDDH VASHISTH: Yeah. Well, again, thank you so much for having me here. I’m super excited about this. Yeah, just a little bit of background about me. So I started off with my undergrad in civil and mechanics from IIT BHU, did a PhD in mechanics at Penn State, and moved to Texas …

HUIZINGA: Go back … go back to, what’s the first one?

VASHISTH: It’s Indian Institute of Technology, in India, so that’s …

HUIZINGA: IIT …

VASHISTH: … IIT. I did my undergrad there and then straight away came to the US to do my PhD in mechanics at Penn State and then ended up going to Texas, to Texas A&M University, and postdoc-ed in a chemical engineering lab, and that’s how I became, like, super familiar and fond of chemical engineers and chemists! [LAUGHTER] And we moved to Seattle, when I got the job at University of Washington in 2021, with my wife and my daughter. And what we do in our lab is we make and break things now! [LAUGHS] We try to see, like, you know, when we are making and breaking these things, we try to see them from an experimental and a simulation point of view and try to gain some understanding of the mechanics of these different types of materials. Especially, we are very interested in polymers. I always joke with my students and my class that go about one day without touching a polymer, and I’m always surprised by the smiles or the smirks that I get! But in general, like, we have been super, super excited and interested about sustainable polymers, making sustainable composites. Particularly, we are very excited and interested in vitrimer polymers. So let me just take, like, a step back. I’ll probably wear my professor hat straight away here.

HUIZINGA: Yeah. Let’s do! Let’s go. [LAUGHTER]

VASHISTH: And I’ll tell you, just, like, taking a step back, what are the different types of polymers. So in general, you can think of polymers as thermosets or thermoplastics. So to Jake’s point, let’s just go to the molecular scale there, and you can think of polymers as bunch of these pasta noodles which can slide over each other, right. Or these bunch of pasta noodles which are packed together. So thermoset, as the name suggests, it’s a set network. The pasta noodles are kind of, like, set in their place. Thermoplastics is when these pasta noodles can slide over each other. So you’ve probably put too much sauce in there! [LAUGHTER] Yeah, so a good analogy there would be a lot of the adhesives that we use are thermosets because they set after a while. Thermoplastic … we use plastics for 3D printing a lot, so those are thermoplastics. So they’re solid. You can heat them up, you can make them flow, print something, and they solidify. Vitrimers are very exciting because, just like thermoplastics, they have this flowability associated to them but more at a molecular scale. Like, if you think of a single pasta noodle, it can unclick and re-click back again. So it’s like, you know, it’s made up of these small LEGO blocks that can unclick and re-click back again …

HUIZINGA: LEGO pasta …

VASHISTH: LEGO pasta …

HUIZINGA: I like that! [LAUGHS]

VASHISTH: Exactly. So this unclicking and re-clicking can make them re-processable, reusable, recyclable. Gives them, like, much longer life because you can heal them. And then vitrimers basically become the vampires of the polymer universe!

HUIZINGA: Meaning they don’t die?

VASHISTH: Well …

HUIZINGA: Or …

VASHISTH: They have like much longer life! [LAUGHTER]

SMITH: They sleep every now and then to regenerate! Yes … [LAUGHS]

HUIZINGA: Aniruddh, sticking with you for a minute, before we get into the collaboration, let’s do a quick level set on what we might call “The Secret Life of Circuit Boards.” For this, I’d like you to channel David Attenborough and narrate this PCB documentary. Where do we find printed circuit boards in their natural habitat? How many species are there? What do they do during the day? How long do they live? And what happens when they die?

VASHISTH: OK, so do I have to speak like David … ?

HUIZINGA: Yes, I’d appreciate it if you’d try. [LAUGHTER] … No. Just be your voice.

VASHISTH: Yeah. Yeah. So PCBs are, if you think about it, they are everywhere. PCBs are in these laptops that we have in front of us. Probably there are PCBs in these mics. Automobiles. Medical devices. So PCBs are, they’re just, like, everywhere. And depending upon, like, what is their end applications, they have a composite part of it, where you have, like, some sort of a stiff inclusion in a polymeric matrix, which is holding this part together and has bunch of electronics on top of it. And depending on the end application, it might come in different flavors: something that can sustain much higher temperatures; something which is flexible. Things of that sort. And they live as long as we use the material for, like, you know, as long as we are using these laptops or as long as we end up using our cars. And unfortunately, there is a lot of e-waste which is created at the end.

HUIZINGA: Right …

VASHISTH: There’s been a lot of effort in recycling and reusing these materials, but I’m confident we can do more.

HUIZINGA: Right.

VASHISTH: I think there’s like close to 50 million metric tons of …

HUIZINGA: Wow!

VASHISTH: … of e-waste which is generated—more than that actually—every year, so …

HUIZINGA: OK.

VASHISTH: … a lot of scope for us to work there.

HUIZINGA: Um, so right now, are they sort of uniform? The printed circuit board? I know we’re going to talk about vitrimer-based ones, but I mean, other than that, are there already multiple materials used for these PCBs? Jake, you can even address that.

SMITH: Yeah. Of course. So there are, like, kind of, graded ranks of circuit board materials …

HUIZINGA: OK.

SMITH: … that as Aniruddh said, you know, might be for specialty applications where you need higher-temperature tolerance than normal or you need lower noise out of your circuit board.

HUIZINGA: Gotcha.

SMITH: But, kind of, the bog-standard circuit board, the green one that you think about if you’ve ever seen a circuit board, this is like anti-flammability coating on a material called FR-4. So FR-4—which is an industrial name for a class of polymers that are flame-retardant, thus FR, and 4 gives you the general class—this is the circuit board material …

HUIZINGA: OK …

SMITH: … that, you know, we really targeted with this effort.

HUIZINGA: Interesting. So, Jake, let’s zoom out for a minute and talk about the big picture and why this is interesting to Microsoft Research. I keep hearing two phrases: sustainable electronics and a circular economy. So talk about how the one feeds into the other and what an ultimate success story would look like here.

SMITH: Yeah, absolutely. So I’ll start with the latter. When we set out to start the Microsoft Climate Research Initiative, we started with this vision of a circular economy that would do things that avoid what we, you know, can avoid using. But there are many cases where you can’t avoid using something that is nonrenewable. And there, what we really want to do is we want to recapture what we can’t avoid. And this project, you know, falls in the latter. There’s a lot of things that fall in the latter case. So, you know, we were looking at this at a very carbon dioxide-centric viewpoint where CO2 is ultimately the thing that we’re thinking about in the circle, although you can draw a circular economy diagram with a lot of things in the circle. But from the CO2 viewpoint, you know, what led us to this project with Aniruddh is we thought, we need to capture CO2, but once you capture CO2, you know, what do you do with it? [LAUGHTER] You can pump some of it back into the ground, but this is, you know, an economically non-productive activity. And so it’s something we have to do. It’s not something we want to do.

HUIZINGA: Right.

SMITH: And so what could we want to do with the CO2 that we’ve captured? And the thought was we do something economically viable with it. We, you know, upcycle the CO2 into something interesting, and what we really want, and what we still really want, is to be able to take that CO2, convert it down into a useful chemical feedstock—and there are great laboratories …

HUIZINGA: Oh, interesting …

SMITH: … doing work on this—and then we could, you know, look at our plastic design problem and say, hey, we have all this FR-4 in the world. How could we replace the FR-4—the, you know, explicit atoms that are in the FR-4—with atoms that have come from CO2 that we pulled out of the air? And so this is, you know, the circular economy portion. We come down to, you know, the specific problem here. Aniruddh talked a lot about e-waste.

HUIZINGA: Yeah.

SMITH: And I have great colleagues who also collaborated with us on this project—Bichlien Nguyen, Kali Frost—who have been doing work with our product teams here at Microsoft on, you know, what can we do to reduce the amount of e-waste that they put out towards Microsoft’s climate goals?

HUIZINGA: Right.

SMITH: And Microsoft, as a producer of consumer electronics and a consumer of, you know, industrial electronics, has a big e-waste problem itself that we need to, you know, actually take research steps in order to ultimately address, and so what we thought was, you know, we have this end-of-life electronic. We can do things like desolder the components. We can recapture those ICs, which have a lot of embedded carbon in them in the silicon that’s actually there. We can take and we can etch out the copper that has been put over this to form the traces, and we can precipitate out that electrochemically to recapture the copper, but at the end of the day, we’re left with this big chunk of plastic, and it’s got some glass inside of it, too, for completeness sake, and the thought was, you know, how do we do this? You can’t recapture this with FR-4. FR-4, to go back to the spaghetti thing, …

HUIZINGA: Right … [LAUGHS]

SMITH: … spaghetti is glued to itself. It doesn’t come apart. It rips apart if you try and take it apart. And so we wanted to say, you know, what could we do and, you know, what could we do with Aniruddh and his lab in order to get at this problem and to get us at a FR-4 replacement that we could actually reach this complete circularity with.

HUIZINGA: Interesting! Well, Jake, that is an absolutely perfect segue into “how I met your mother,” which is, you know, how you all started working together. Who thought of who first, and so on. I’m always interested to hear both sides of the meet-up. So, Aniruddh, why don’t you take the baton from Jake right there and talk about, from your perspective, how you saw this coming together, who approached who, what happened—and then Jake can confirm or deny the story! [LAUGHTER]

VASHISTH: Yeah, yeah. So it actually started off, I have a fantastic colleague and a very good friend in CS department, Professor Vikram Iyer, and he actually introduced me to Bichlien Nguyen from Microsoft, and we got a coffee together and we were talking about vitrimers, like the work that we do in our lab, and I had this one schematic—I forget if it was on my phone or I was carrying around one paper in my pocket—and I showed them. I was like, you know, if we can actually do a bunch of simulations, guide an ML model, we can create, for lack of a better word, like a ChatGPT-type of model where instead of telling like, “This is the chemistry; tell me what the properties are,” we can go from the other side. You can ask the model, “Hey, I want a vitrimer chemistry which is recyclable, re-processable, that I can make airplanes out of or I can make glasses out of. Tell me what that chemistry would look like.” And I think, you know, Bichlien was excited about this idea, and she connected me with Jake, and I think I’ve been enjoying this collaboration for the last couple of years, …

HUIZINGA: Right …

VASHISTH: … working on that.

HUIZINGA: Was there a paper that started the talk, or was it just this napkin drawing? [LAUGHS]

VASHISTH: I think, to give myself a little bit of credit there, I think there was a paper with a nice drawing on it.

HUIZINGA: Right?

VASHISTH: Yeah. There was a white paper. Yeah.

HUIZINGA: That’s good. Well, Jake, what’s your side of this story?

SMITH: Ah, this is awesome! We got the first half that I didn’t know, so …

HUIZINGA: Oh—filling in gaps!

SMITH: This was the Bichlien-mediated half! [LAUGHTER] I was sharing an office with Bichlien, who apparently came up from this meeting, and, you know, I saw the mythical paper! She put this on my desk. And I’ll plug another MCRI project that we were working on there where—or at the time—where we were attempting to do reverse design, or inverse design, of metal organic frameworks, which are these really interesting molecules that have the possibility to actually serve as carbon capture absorbents, …

HUIZINGA: Oh, wow.

SMITH: … but the approach there was to use machine learning to help us, you know, sample this giant space of metal organic frameworks and find ones that had the property that we cared about. I mean, you draw this diagram that’s much like Aniruddh just described, where you’ve got this model that you train and out the other side comes what you want, and so this paper came down on my desk, and I looked at it and I said, “Hey, that’s what we’re doing!” [LAUGHTER] And it, kind of, you know, went from there. We had a chat. We determined, hey, we’re both interested in, you know, this general approach to getting to novel materials.

HUIZINGA: Right.

SMITH: And then, you know, we’ve already talked about the synergy between our interests and Microsoft’s interests and the, you know, great work or the great particular applications that are possible with the type of polymer work that Aniruddh does.

HUIZINGA: Yeah. So the University of Washington and Microsoft meet again. [LAUGHTER] Well, Jake, let’s do another zoom out question because I know there’s more than just the Microsoft Climate Research Initiative. This project is a perfect example of another broader initiative within Microsoft which has the potential to quote “accelerate and enhance current research,” and that’s AI for Science. So talk about the vision behind AI for Science, and then if you have any success stories—maybe including this one—tell us how it’s working out.

SMITH: Yeah, absolutely. We are—and by we, I mean myself and my immediate colleagues—are certainly not the only ones interested in applying AI to scientific discovery at Microsoft. And it turned out, a year or two after we started this collaboration, a bigger organization named AI for Science arose, and we became part of it. And it’s, you know, generally a group of people who—along with our kind of sister organization in research called Health Futures, who work more on the biology side—are interested in how AI can help us do science in (a) a faster way, but (b) maybe a smarter, better-use-of-resources way, or the ultimate goal, or the ultimate dream, is (c) a way that we just can’t think of doing right now. A way that, you know, it just is fundamentally incompatible with the way that research has historically been done in, you know, small groups of grad students directed by a professor who are themselves, you know, the actual engine behind the work that happens. And so the AI for Science vision, you know, it’s got a couple of parts that really map very well onto this project. The first part is we want to be able to simulate bigger systems. We want to be able to run simulations for longer, and we want to be able to do simulations at higher accuracy. When we get into the details of, you know, the particulars of the vitrimer project, you’ll see that one of the fundamental blocks here is the ability to run simulations, and Aniruddh’s excellent grad student Yiwen, you know, spent a ton of time trying to identify the appropriate simulation parameters in order to capture the behavior that we care about here. And so, the first AI for Science vision says we don’t need Yiwen to do that, you know, we’re going to have a drop-in solution or we’re going to have, you know, a set of drop-in solutions that can, you know, take this work away from you and make it much easier for you to go straight to running the simulations that you care about.

HUIZINGA: Yeah. A couple questions. Not on the list here, but you prompted them. No pun intended. Are these specialized models with the kinds of information … I mean, if I go to ChatGPT and ask it to do what you guys are doing, I’m not going to get the same return am I?

SMITH: Absolutely.

HUIZINGA: Am I?

SMITH: Oh, no, no, no, no! [LAUGHTER] I was saying you were absolutely correct. [LAUGHS] You can ask ChatGPT, and it will tell you all sorts of things that are very interesting. It can tell you, probably, a vitrimer. It could give you Aniruddh’s spiel about the spaghetti, I’m sure, if you prompted it in the correct way. But what it can’t tell you is, you know, “Hey, I have this particular vitrimer composition, and I would like to know at what temperature it’s going to melt when I heat it up.”

HUIZINGA: Right. OK, so I have one more question. You talk about the simulations. Those take a lot of compute. Am I right? Am I right?

SMITH: You’re absolutely right.

VASHISTH: Yeah.

HUIZINGA: So is that something that Microsoft brings to the party in terms of … I mean, does the University of Washington have the same access to that compute, or what’s the deal?

VASHISTH: I think especially on the scale, we were super happy and excited that we were collaborating with Microsoft. I think one of these simulations took, like, close to a couple of weeks, and we ended up doing, I would say, like, close to more than 30,000 simulations. So that’s a lot of compute time if you think about it.

HUIZINGA: To put that in perspective, how long would it take a human to do those simulations? [LAUGHS]

SMITH: [LAUGHS] Oh, man, to try and actually, like, go do all this in the lab …

HUIZINGA: Right!

SMITH: First, you got to make these 30,000, like, starting materials. This in itself … let’s say you could buy those. Then to actually run the experiments, how long does it take to do one …

HUIZINGA: And how much money?

VASHISTH: That’s … that’s like you’re talking about like one PhD student there.

HUIZINGA: Right?

VASHISTH: That’s like, you know, it takes like a couple of years just to synthesize something properly and then characterize it, and it’s …

HUIZINGA: Yeah …

VASHISTH: Yeah, no, I think the virtual world does have some pluses to it.

HUIZINGA: So this is a really good argument for AI for Science, meaning the things that it can do, artificial intelligence can do, at a scale that’s much smaller than what it would take a human to do.

SMITH: Yeah, absolutely. And I’ll plug the other big benefit now, which is, hey, we can run simulations. This is fantastic. But the other thing that I think all of us really hope AI can do is it can help us determine which simulations to run …

HUIZINGA: Ooh …

SMITH: … so we need less compute overall, we need less experiments if we have to go do the experiments, and this is …

HUIZINGA: So it’s the winnowing process.

SMITH: Exactly.

HUIZINGA: OK. That’s actually really interesting.

SMITH: And this is, like, the second, or maybe even the largest, vector for acceleration that we could see.

HUIZINGA: Cool. Well, every show I ask, what could possibly go wrong if you got everything right? And, Aniruddh, I want to call this the “Defense Against the Dark Arts” question for you. You’re using generative AI to propose what you call novel chemistries, which can sound really cool or really scary, depending on how you look at it. But you can’t just take advice from a chatbot and apply it directly to aerospace. You have to kind of go through some processes before. So what role do people, particularly experts in other disciplines, play here, and what other things do you need to be mindful of to ensure the outputs you get from this research are valid?

VASHISTH: Yeah, yeah. That’s a fantastic question. And I’ll actually piggyback on what Jake just said here, about Yiwen Zheng, who’s like a fantastic graduate student that we have in our lab. He figured out how to run these simulations at the first point. It was like six months of … like, really long ordeal. How to make sure that in the virtual world, we are synthesizing these polymers correctly and we are testing them correctly. So that human touch is essential, I feel like, at every step of this research, not just like doing virtual characterization or virtual synthesis of these materials, training the models, but eventually, when you train the models also and the model tells you that, well, these are, like, the 10 best polymers that would work out, there you need people like Jake who are like chemists, you know. They come in [LAUGHTER] and they’re like, hey, you know what? Like, out of these 10 chemistries, this one you can actually synthesize. It’s a one-step reaction or things of that sort. So we have a chemist in our lab also, Dr. Agni Biswal, who’s a postdoc. So we actually show him all these chemistries, apart from Jake and Bichlien. We show the chemistries to all the chemists and say, like, OK, what do you think about this? How do these look like? Are they totally insane, or can we actually make them? [LAUGHTER]

SMITH: Yeah, we still need that, like, human evaluation step at the end, at this point.

HUIZINGA: Yeah … VASHISTH: Exactly.

HUIZINGA: Ask a chemist! Well, and I would imagine it would be further than just, “This would be the best one,” or something like, “You better not do that one.” Are there ever like crazy responses or replies from the model?

SMITH: [LAUGHS] It’s fascinating. Models are very good—and particularly we’ll talk about models that generate small organic structures—at generating things that look reasonable. They follow all the rules. But there’s this next step beyond that. And you see this when you talk to people who’ve worked in med chem for, you know, 30 years of their life. Well, they’ll look at a structure and they’ll, like, get this gut feeling like, you know, a storm is coming in and their knee hurts, and they really don’t like that molecule. [LAUGHTER] And if you push them a little bit, you know, sometimes they can figure out why. They’ll be like, oh, I worked on, you know, a molecule that looked like that 20 years ago, and it, you know, turned out to have this toxicity, and so I don’t want to touch that again. But oftentimes, people can’t even tell you. They’ve just got this instinct …

HUIZINGA: Really?

SMITH: … that they’ve built up, and trying to, you know, capture that intuition is a really interesting next frontier for this sort of research.

HUIZINGA: Wow. You know, you guys are just making my brain fry because it’s like so many other questions I want to ask, but we’re actually getting there to some of them, and I’m hoping we’ll address those questions with the other things I have. So, Jake, I want to come … Well, first of all, Aniruddh, have you finished your defense against the dark arts? [LAUGHS]

VASHISTH: I think I can point out one more thing very quickly there, and as Jake said, like, we are learning a lot, particularly about these materials, like, the vitrimer materials. These are new chemistries, and we are still learning about, like, the mechanical, thermorheological properties; how to handle these materials. So I think there’s a lot that we don’t know right now. So it’s like a bunch of, like, unknown unknowns that are there. So …

HUIZINGA: Well, and that’s research, right? The unknown unknowns. Jake, I want to come back to the vision of the climate research initiative for a minute. One goal is to develop technologies that reduce the raw tonnage of e-waste, obviously. But if we’re honest, advances in technology have almost encouraged us to throw stuff away. It’s like before it even wears out. And I think we talked earlier about, you know, this will last as long as my car lasts or whatever, but I don’t like my car in five years. I want a different one, right? So I wonder if you’ve given any thought to what things, in addition to the work on reusable and recyclable components, we might do to reverse engineer the larger throwaway culture?

SMITH: This was interesting. I feel like this gets into real questions about social psychology and our own behaviors …

HUIZINGA: Yeah …

SMITH: … with individual things. Why do I have this can of carbonated water here when I could have a glass of carbonated water? But I want to, kind of, completely sidestep that because …

HUIZINGA: Yeah … Well, we know why! Because it’s convenient, and you can take it in your car and not spill.

SMITH: Agreed. Yes. All right. [LAUGHTER] I also have this cup, and it could not spill, as well.

HUIZINGA: True! Recyclable—reusable.

SMITH: Ahhh … no, no … this is like a—it’s an ingrained consumer behavior that I’ve developed that might … I’ll slip into “Jake’s Personal Perspectives” here, which is that it should not be on the individual consumer behavior changes to ultimately drive a shift towards reusable and recyclable things. And so one of the fundamental, like, hypotheses that we had with the, you know, design of the projects we put together with the MCRI was that if we put appropriate economic incentives in place, then we can naturally guide behavior at a much bigger scale than the individual consumer. And maybe we’ll see that trickle down to the consumer. Or maybe this means that the actual actors, the large-scale actors, then have the economic incentive to follow it themselves.

HUIZINGA: Right.

SMITH: And so with the e-waste question in particular, we talked a lot about FR-4 and, you know, it’s the part of the circuit board that you’re left over with at the end that there’s just nothing to do with …

HUIZINGA: Right.

SMITH: … and so you toss into landfill, you burn it, you do something like this. But, you know, with a project like this, where our goal was to take that material and now make it reusable, we can add this actual economic value to the waste there.

HUIZINGA: Yeah. I realized even as I asked that question, that I had the answer embedded in the question because, in part, how we design technologies drives how people use things.

SMITH: Yeah, absolutely. VASHISTH: Yeah.

HUIZINGA: And usually, the drivers are convenience and economics. So if upstream of consumer … consumption? [LAUGHTER] Upstream of that, the design drives environmental health and so on, that’s actually … that’s up to you guys! So let’s get out of this booth and get back to work! [LAUGHTER] Well, Jake, to that point, talk about the economics. We talk about a circular economy. And I know that recycling is expensive. Can you talk a little bit about how that could be impacted by work that you guys do?

SMITH: Recycling absolutely is expensive relative to landfilling or a similar alternative.

HUIZINGA: Right …

SMITH: One of the things that makes us target e-waste is that there are things of value in e-waste that are, like, innately valuable. When you go recollect that copper or the gold that you’ve put into this, when you recollect the integrated circuits, you know, they had value, and so a lot of the economic drive is already there to get you to the point where you have these circuit boards. And then, you know, the question was, how do we get that next bit of economic value so that you’ve taken steps this far, you have this pile of circuit boards, so you’ve already been incentivized to get to here and it will be easy to make this—even if it’s not a completely economically productive material—versus synthesizing a circuit board from virgin plastic, but it’s offset enough. We’ve taken enough of that penalty for reuse out that it can be justifiable to go do.

HUIZINGA: Right. OK. So talk—again, off script a little bit—but talk a little bit about how vitrimers help take it to the last mile.

VASHISTH: Yeah, I think the inherent property of the polymer to kind of unclick and re-click back again, the heal-ability of the polymer, that’s something that, kind of, drives this reusability and re-processability of the material. I’ll just, like, point out, like, you know, particularly to the PCB case, where we recently published a collaborative paper where we showed that we can actually make PCB boards using vitrimers. We can unassemble everything. We can take out the electronics, and even the composite, the glass fiber and the polymer composite, we can actually separate that, as well, which is, in my mind, like, a pretty big success.

HUIZINGA: Yeah.

VASHISTH: And then we can actually put everything back together and remake a PCB board, and, you know, keep on doing that. So …

HUIZINGA: OK, so you had talked to me before about “Ring Around the Rosie” and the hands and the feet. Can you … ?

SMITH: [LAUGHS] His favorite analogy!

HUIZINGA: Do that one just for our audience because it’s good.

VASHISTH: OK. So I’ll talk a little bit about thermoset/thermoplastic again, and then I’ll just give you a much broader perspective there.

HUIZINGA: Yeah.

VASHISTH: So the FR-4 PCBs that are made, they are usually made with thermosetting polymers. So if you think about thermosetting polymers, just think of kids playing “Ring of Roses,” right? Like their hands are fixed and their feet are fixed. Once the network is formed, there’s no way you can actually destroy that network. The nice thing about vitrimers is that when you provide an external stimulus, like, just think about these kids playing “Ring of Roses” again. Their feet can move and their handshakes can change, but the number of handshakes remain the same. So the polymer is kind of, like, unclicking and re-clicking back again.

HUIZINGA: OK.

VASHISTH: And if you can cleverly use this mechanism, you can actually recycle, reprocess the polymer itself. But what we showed, particularly for the PCB paper, was that you can actually separate all the other constituents that are associated with this composite, yeah.

HUIZINGA: OK. That’s … I love that. Well, sticking with you for a second, Aniruddh, talking about mechanical reality—not just chemical reality, but mechanical reality—even the best composites wear out, from wear and tear. Talk about the goal of this work on novel polymers from an engineering perspective. How do you think about designing for reality in this way?

VASHISTH: Yeah, yeah. That’s a fantastic question. So we were really motivated by what type of mechanical or thermal loadings materials see in day-to-day life. You know, I sit in my car, I drive it, it drives over the road, there is some fatigue loadings, there’s dynamic loading, and that dynamic loading actually leads to some mechanical flaws in the material, which damages it. And the thought was always that, can we restrict that flaw, or can we go a step further? Can we actually reverse that damage in these composites? And that’s where, you know, that unclicking/re-clicking behavior of vitrimer becomes, like, really powerful. So actually, the first work that we did on these type of materials was that we took a vitrimer composite and we applied fatigue loading on it, cyclic loading on it, mechanical loading. And then we saw that when there was enough damage accumulated in the system, we healed the system. And then we did this again. And we were able to do it again and again until I was like, I’ve spent too much money on this test frame! [LAUGHS] But it was really exciting because for a particular loading case that we were looking at, traditional composites were able to sustain that for 10,000 cycles, but for vitrimers, if we did periodic healing in the material, we were able to go up to a million cycles. So I think that’s really powerful.

HUIZINGA: Orders of magnitude.

VASHISTH: Yeah, exactly.

HUIZINGA: Wow. Jake, I want to broaden the conversation right now, beyond just you and Aniruddh, and talk about the larger teams you need to assemble to ensure success of projects like this. Do you have any stories you could share about how you go about building a team? You kind of alluded to it at the beginning. There’s sort of a pickup basketball metaphor there. Hey, he’s doing that. We’re doing this. But you have some intentionality about people you bring in. So what strengths do each institution bring, and how do you build a team?

SMITH: Yeah, absolutely. We’ve tried a bunch of these collaborations, and we’ve definitely got some learnings about which ones work better than others. This has been a super productive one. I think it’s because it has that right mix of skills and the right mix of things that each side are bringing. So what we want from a Microsoft side for a successful collaboration is we want a collaborator who is really a domain expert in, you know, something that we don’t necessarily understand but who can tell us, in great detail, these are the actual design criteria; these are, you know, where I run into trouble with my traditional research; this is the area that, you know, I’d like to do faster, but I don’t necessarily know how. And this was the critical part, I think, you know, from the get-go. They need to, themselves, be an extremely, you know, capable subject matter expert. Otherwise, we’re just kind of chatting. We don’t have anyone that really knows what the problem truly is and you make no progress or you … worse, you spend a whole lot of resources to make “progress”—I’m doing air quotes …

HUIZINGA: Yeah. I love air quotes on a podcast!

SMITH: [LAUGHS]—that is actually just completely tangential to what the field needs or what the actual device needs. So this was, you know, the fundamental ingredient. And then on top of that, we need to find a problem that’s of joint interest where, in particular, …

HUIZINGA: Right …

SMITH: … computation can help. You talked about the amount of computation that we have at our disposal as researchers at Microsoft, which is a tremendous strength. And so we want to be able to leverage that. And so for a collaboration like this, where running a large number of simulations was a fundamental ingredient to doing it, this was, you know, a really good fit, that we could come in and we could enable them to have more data to train the models that we build together.

HUIZINGA: Mm-hm. Well, as researchers, are you each kind of always scanning the horizon for who else is doing things in your field that—or tangential to your field but necessary? How does that work for recruiting, I would say?

VASHISTH: Yeah, that’s a good question. I think … I mean, that’s kind of like the job, right. For the machine learning work we did, we saw a lot of inspiration from biology, where people have been designing biomolecules. The challenges are different for us. Like, we are designing much larger chains. But we saw some inspiration from there. So always, like, looking out for, like, who is doing what is super helpful, and it leads to, like, really nice collaborations, as well. We’ve had, like, really fruitful collaborations with the professor Sid Kumar at TU Delft, and we always get his wisdom on some of these things, as well. But yeah, recruiting students also becomes, like, very interesting and how, like, people who can help us achieve our idea …

HUIZINGA: Yeah. Jake, what’s your take on it from the other seat? I mean, do you look actively at universities around the world—and even in your backyard—to … like U Dub … ? [LAUGHTER]

SMITH: My perspective on, like, how collaborations come in to be is they’re really serendipitous. You know, we talked about how this one came in to be, and it was because we all happen to know Vikram, and Vikram happened to connect Bichlien with Aniruddh, and it kind of rolled from there. But you can have serendipitous, you know, meetings at a conference, where you happen to, you know, sit next to someone at a talk and you both share the same perspective on, you know, how a research problem should be tackled, and something could come out of that. Or in some cases, you go actually shopping for a collaborator.

HUIZINGA: Right. [LAUGHTER]

SMITH: You know, you need to talk to 10 people to find the one that has that same research perspective as you. I’ll second Aniruddh’s, you know, observation that you get a very different perspective if you go find someone who, they may have the same, like, perspective on how research should be tackled, but they have a different perspective on what the ultimate output of that research would be. But, you know, they can often point you in areas where your research could be helpful that you can’t necessarily see because you lack the domain knowledge or you lack that particular angle on it.

HUIZINGA: Which is another interesting thing in my mind is, you know, the role that papers, published papers, play—that’s a lot of p’s in a sentence [LAUGHTER] … alliteration—that you would be reading or hearing about either in a lightning talk or a presentation at a conference. Does that broaden your perspective, as well? And how do you … like, do you call people up? “I read your paper … ”?

SMITH: [LAUGHS] I have cold-emailed people. You know, this works sometimes! Sometimes this is just the introduction that you need. But the interesting thing in my mind is how much the computer science conferences and things like ChemRxiv and arXiv have really replaced, for me, the traditional chemistry literature or the traditional publishing literature where you can have a conversation with this person while they’re still actively doing the work because they put their initial draft up there and it still needs revision, and there’s opportunities even earlier on in the research process than we’ve had in the past.

HUIZINGA: Huh. And to your earlier point, I’m envisioning an Amazon shopping cart for research collaborators. [LAUGHTER] “Oh, he looks good. Into my cart.” Aniruddh, I always like to know where a project is on the spectrum from what I call lab to life, and I know there are different development stages when it comes to technology finding its way into production and then into broader use. So to use another analogy I like, pretend this is a relay race and research is the first leg. Who else has to run, and who brings it across the line?

VASHISTH: Yeah, yeah. So I think the initial work that we have done, I think it’s been super fruitful, and to Jake’s point, like, converging to, like, a nice output. It took a bunch of chemists, mechanical engineers, simulation folks, machine learning scientists to get where we are. And, as Jake mentioned, we’ve actually put some of our publications on arXiv, and it’s getting traction now. So we’ve had some excitement from startups and companies which make polymers asking us, “Oh, can you actually … can we get a slice of this framework that you’re developing for designing vitrimers?” Which is very promising. So we have done very fundamental work, but now, like, what’s called “the valley of death” in research, [LAUGHTER] like taking it from lab to like production scale, …

HUIZINGA: Yeah.

VASHISTH: … it’s usually a very tightly knit collaboration between industry, labs, and sometimes national labs, too. So we’re excited that, actually, a couple of national labs have been interested in the work that we have been doing, so super optimistic about it.

HUIZINGA: So would you say that the vitrimer-based printed circuit board is a proof of concept right now? Or have you made prototypes? Where is that now?

SMITH: Yeah, absolutely. We’ve mentioned our other collaborator, Vikram Iyer, a couple of times. And in collaboration with his lab, we did actually make a prototype circuit board. We showed that it works as you expect. We showed that it can be disassembled. It can be put back together, and it still works as expected …

HUIZINGA: The “break stuff/make stuff back” thing …

VASHISTH: Yeah, exactly.

SMITH: But, you know, I think to the spirit of the question, it’s still individual kind of one-off experiments being run in a lab, and Aniruddh is right. There’s a long way to go from, like, Technology Readiness Level 3, where we’re doing it ourselves on bench scale, up to, you know, the 7, 8, 9, where it’s actually commercially viable and someone has been able to reproduce this at scale.

HUIZINGA: Right. … So that’s when you bring investors in or labs that can make stuff in and scale.

VASHISTH: Yeah. Yeah, I think once you’re, like, close to 7, I think that’s where you’re pretty much ready for the big show.

HUIZINGA: So where are you now? 2? 3?

VASHISTH: I would say, like, 2 or 3 …

SMITH: 2, 3, somewhere in that range.

VASHISTH: Yeah.

HUIZINGA: OK.

SMITH: The scales, kind of, differ depending on which agencies you see put it out.

HUIZINGA: So, Jake, before we close, I want to talk briefly about other applications of recyclable vitrimer-based polymers, in light of their importance to the climate research initiative and AI for Science. So what other industries have polymer components that have nowhere to go after they die but the landfill, and will this research transfer across to those industries?

SMITH: An excellent question. So my personal view on this is that there’s a couple of classes of polymers. There’s these very high-value application uses of polymers where we’re talking about the printed circuit boards; we’re talking about aerospace composite; we’re talking about the panels on your car; we’re talking about things like wind turbines …

HUIZINGA: Oh, yeah.

SMITH: … where there’s a long life cycle. You have this device that’s going to be in use for five years, 50 years, and at the end of that, the polymer itself is still probably pretty good. You could still use it and regenerate it. And so Aniruddh’s lab has done great work showing that you can take things like the side panel of a plane and actually disassemble this thing, heal it, keep it in use longer, and use it at the end of its lifetime. There’s this other class of polymers, which I think are the ones that most people think about—your Coke bottle—and vitrimers seem like a much harder sell there. I think this is more the domain of, you know, biodegradable polymers in the long run to really tackle the issues there. But I’m very excited in this, you know, high-value polymer, this long-lifetime polymer, this, like, permanent install polymer, however you want to think about it, for work like this to have an impact.

HUIZINGA: Yeah. From your lab’s perspective, Aniruddh, where do you see other applications with great promise?

VASHISTH: Yeah. So as Jake said, places where we need high-performance polymers is where we can go. So PCBs is one, aerospace and automotive industry is one, and maybe medical industry is, …

HUIZINGA: Oh, interesting…

VASHISTH: … yeah, is another one where we can actually … if you can make prosthetics out of vitrimers … prosthetics actually lose a little bit of their stiffness, you know, as you use them, and that’s because of localized damage. It’s the fatigue cycle, right. So what if you can actually heal your prosthetics and reuse them? So, yeah, I feel like, you know, there’s so many different applications, so many different routes that we can go down.

HUIZINGA: Yeah. Well, I like to end our Collaborators shows with a little vision casting, and I feel like this whole podcast is that. I should also say, you know, back in the ’50s, there was the big push to make plastics! Your word is vitrimers! So let’s do a little vision casting for vitrimer-based polymers. Assuming your research is wildly successful and becomes a truly game-changing technology, what does the future look like—I mean, specified future, not general future—and how has your work disrupted this field and made the world a better place? I’ll let you each have the last word. Who’d like to go first?

VASHISTH: Sure, I can go first. I’ll try to make sure that I break it up into computation and experiments …

HUIZINGA: Good.

VASHISTH: … so that once I go back, like, my lab does not, like, pounce on me. [LAUGHS] Yeah, so I think from the computation point of view, we always thought that if somebody gave us, like, a hundred different chemistries, we can actually bottle it down to, like, we can do a bunch of simulations; tell you, like, 10 of these actually work. What we’ve been able to do specifically for vitrimers is that we’re able to look at the problem from the other side, and we are able to say that if you tell me a particular application, this particular chemistry would work best for you. In essence, what we were thinking of is that if aliens abducted all the chemists from the world, can we actually come up with a framework? [LAUGHS] So I think it’ll be difficult to get there because as I said earlier that, you know, you need that human touch. But I think we are happy that that we are getting there. And I think what remains to be seen now is, like, you know, now that we have this type of a framework, like what are the next challenges? Like, we are going from the lab to the large scale; like, what challenges are associated there? And I think similarly for the experimental side of things also, we know a lot—we have developed frameworks—but there’s a lot of work that still needs to be done in understanding and translating these technologies to real-life applications.

HUIZINGA: I like that you’re kind of hedging your bets there, saying, I’m not going to paint a picture of the perfect world because my lab is going to be responsible for delivering it. [LAUGHTER] Jake, assuming you haven’t been abducted by aliens, what’s your take on this?

SMITH: I view, kind of, the goal of this work and the ideal impact of this work as an acceleration of getting us to these polymers being deployed in all these other applications that we’ve talked about, and we can go broader than this.

HUIZINGA: Yeah …

SMITH: I think that there’s a lot of work, both within the MCRI, within Microsoft, and outside of Microsoft in the bigger field, focused on acceleration towards a specific goal. And if all of this work is successful, in 10 years, maybe our materials design process looks completely different, where we’ve gone from this kind of brute-force screening that Aniruddh has talked about to an approach where you start with the properties that you care about; they’re defined by the application that you have in mind. You want to make your vitrimer PCB, it needs to have, you know, a specific temperature where it becomes gummy; it needs to have a specific resistance to burning; it needs to be able to effectively serve as the dielectric for your bigger circuits. And we use this, like, “need space” to define the material that we would like, and we can use machine learning, artificial intelligence, in order to get us to the structure that we need to make in order to actually achieve this design space. And so, this was, you know, our big bet within AI for Science. This is the big bet of this project. And with this project, you know, we take one step towards showing that you can do this in one case. And the future casting would be we can do this in every materials design case that you can think about.

HUIZINGA: Hmmm. You know, I’m thinking of lanes—track analogy again—but, you know, you’ve got mechanical engineering, you’ve got chemistry, and you’ve got artificial intelligence, and each of those sciences is advancing, and they’re using each other to, sort of, help advance in various ways, so this is an exciting, exciting project and collaboration.

[MUSIC]

Jake, Aniruddh, thanks for joining us today on Collaborators. This has been really fun for me. [LAUGHTER] So thanks for coming in and sharing your stories today.

VASHISTH: Thank you so much.

SMITH: Yeah. Of course. Thank you.

[MUSIC FADES]

The post Collaborators: Sustainable electronics with Jake Smith and Aniruddh Vashisth appeared first on Microsoft Research.

Read More

Unified Database: Laying the foundation for large language model vertical applications

Unified Database: Laying the foundation for large language model vertical applications

A diagram showing splitting vector partitions and reallocating vectors in partitions to adapt to changes in data distribution.

Large language models (LLMs) have become a valuable technology in areas such as content creation, language comprehension, and intelligent dialogue, or interactions between people and computer systems. However, these models generate responses based on patterns and rules observed in fixed training data, which can potentially lead them to produce erroneous and even fictitious information. The models can also struggle with real-time knowledge updates. One technique known as retrieval augmented generation (RAG) can organically combine fresh external information with LLMs, putting relevant and precise knowledge into context to help guide the answer generation process, enhancing their performance and reliability.
 
One of the core components of RAG, the vector database, significantly differs from traditional relational databases in its storage and query mechanisms. This presents a challenge to the unified management of increasingly diverse and multimodal knowledge bases. Researchers from the Systems and Networking Group at Microsoft Research Asia believe that a unified database capable of managing rich attributes and modalities of external knowledge will support widespread application and improved reliability of LLMs.
 
“As the capabilities of large models continue to improve, various types of data, such as text, images, and videos, can be encoded into high-dimensional vectors using machine learning technology. Detailed attributes of knowledge, such as the type of images, user preferences, etc., can be converted into different data features. It becomes difficult to achieve efficient and accurate query results among these mixed types of information. Therefore, a unified database is needed to effectively manage the data, providing a more solid knowledge foundation for LLMs,” said Qi Chen, a principal researcher at the Microsoft Research Asia lab in Vancouver, Canada.

VBase query system: Providing a unified foundation for vector index and scalar index scanning

Vector databases and scalar databases have different index scan patterns. Therefore, the lack of a unified foundation is the first problem to be solved in building a unified database.
 
Scalar indexes are based on numerical order, and the scanning of these indexes follows a strictly increasing or decreasing order. This is the primary reason why relational databases can efficiently execute queries. For example, when searching for clothes priced between 100 and 200 Canadian dollars on a shopping platform, the system starts scanning from the price of C$100, and the query stops once the price exceeds C$200. Clearly, this monotonicity-based scalar query is highly efficient.
 
In contrast, vector indexes are built on proximity in high-dimensional space, and index traversal cannot follow a strict order, so they lack monotonicity. Vector indexes only provide approximate spatial navigation for queries, to estimate the nearest subspace. To achieve early termination, the vector index scanning process relies on the TopK algorithm to provide the temporary order. Although the order can be used to terminate the execution in advance, this method is inefficient.

Diagrams illustrating query execution on scalar database and vector database. Left: Scalar index diagram with ordered scanning; Right: Vector index diagram with no strict order.
Figure 1. Query execution on scalar database and vector database

For example, suppose a person has a picture of a garment and wants to find similar items on a shopping platform that are priced below C$200. The traditional method is to first conduct a similarity query to get a large number of candidates, and then filter based on price. For instance, to find the top 10 most similar and appropriately priced results, one can first set the search range to 1,000 candidates, and then filter one by one according to the price condition until 10 results appear that meet the requirements. If the results are insufficient, the search range can be expanded to 2,000 or 3,000 until the requirements are met. 

This method is designed to convert the retrieval results of vector data into a temporary scalar index that follows strict monotonicity and then perform scalar queries. 

The problem with this method is that it cannot guarantee that the K results returned will meet the final filtering query requirements. Therefore, to ensure that the filtering results meet the requirements, either TopK needs to perform a wider similarity query, returning more Ks; or when K is insufficient, TopK queries are repeated. But both methods will lead to suboptimal query performance.

By analyzing a large number of vector indices, researchers have found that vector index queries do not require strict monotonicity for early termination. The traversal of vector indices exhibits a kind of relaxed monotonicity. The traversal of scalar indices is a special case of this relaxed monotonicity.

Based on this discovery, researchers have developed the VBase unified database system. This system provides a unified foundation for efficient scanning of vector indices and scalar indices, making the scanning of various indices follow the same interface and early termination conditions. This innovation not only improves the performance of vector databases in executing complex queries by 10 to 1,000 times, but also improves the accuracy of queries.

VBase makes it possible to build a unified database capable of executing various complex relational vector and scalar mixed queries. Currently, based on the VBase (opens in new tab) system, an open-source database platform has successfully built its own multimodal vector database.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.


SPFresh: First vector index that supports real-time in-place incremental update 

The RAG technology based on vector database retrieval significantly improves the accuracy of the generation results of LLMs. However, this improvement requires real-time updates to the data in the vector database. For vectors with hundreds to thousands of dimensions, updating is not easy – it can take days to reconstruct the vector index. 

Scalar databases typically use a B-tree or B+ tree index, which can complete updates by directly inserting information after finding the specified location through binary search. However, updating a vector database is much more complicated. 

Take the currently popular fine-grained graph-based vector index and coarse-grained cluster-based vector index as examples. When inserting or deleting vectors in the fine-grained graph vector index, it is necessary to perform large-scale graph scanning to find the appropriate neighbors for update, which requires a lot of computational resources.

Meanwhile, an insufficient update can weaken performance and accuracy. In the update of the coarse-grained cluster index, although the insertion or deletion of vectors only involves the modification of the nearest partition, the cost is lower. But as partition updates accumulate, the data distribution can become unbalanced, which could affect query latency and accuracy, leading to a decline in index quality.

Existing vector index update methods rely on periodic global rebuilding, which is slow and resource intensive. Although performance and accuracy are immediately improved after rebuilding, they gradually decline between rebuilds. In addition, the cost of global rebuilding is very high, requiring more than 10 times the resources of traditional indexing, possibly exceeding the cost of index search services. 

To solve these problems, researchers from Microsoft and their collaborators have proposed SPFresh, which is the first vector index that supports real-time, in-place, incremental updating of unified databases. The core of SPFresh is LIRE – a lightweight incremental rebalancing protocol used to dynamically split or merge vector partitions and reallocate vectors in partitions to adapt to changes in data distribution. LIRE achieves low-resource vector updates by reallocating vectors only at nearby partitions. 

A diagram showing splitting vector partitions and reallocating vectors in partitions to adapt to changes in data distribution.
Figure 2. Partition splitting requires reallocating vector data

Compared to existing periodic index rebuilding methods, SPFresh can greatly reduce the resources required for index rebuilding, and can always maintain a stable high recall rate, low latency, and high query throughput, effectively adapting to dynamic changes in data distribution in a timely manner.

OneSparse: A unified system for sparse and dense multi-index vector search

Vector databases are widely used in fields such as natural language processing, information retrieval, and recommendation systems, providing efficient solutions for handling unstructured data. However, various encoding methods for vector data exist, with sparse and dense vectors each having their own advantages for different types of tasks. For example, sparse vectors are suitable for keyword matching tasks, while dense vectors are better for extracting semantic information. Therefore, in practical applications, multi-index mixed queries are widely used, especially in mixed data sets, where the method of finding similar items by combining sparse and dense feature collaborative filtering has been proven to improve the accuracy of query results.

However, due to the special traversal manner of vector indexes, the intersection between multiple vector indexes cannot be directly pushed down, making it difficult to combine search results from multiple indexes.

To overcome this challenge, researchers from Microsoft and their collaborators have introduced OneSparse, a unified index system catering to both sparse and dense vectors. OneSparse enables the execution of multi-index mixed queries and dynamically generates the optimal merge plan, facilitating rapid intersection and union operations within a single index during index traversal.

OneSparse unifies sparse indexes and dense indexes into a single inverted index and rearranges all posting lists according to the document ID, ensuring efficient execution, even when performing complex queries for both semantic and keyword matching. The technology has been successfully applied in Microsoft Bing’s web search and sponsored search. 

Diagram illustrating the OneSparse index overview. For sparse data, OneSparse maintains one dimension of the sparse vectors (i.e., term) per inverted posting list, which allows fast lookup to all relevant documents of a word in a query. The values stored in an inverted posting list are pairs of ID and a single dimensional feature (e.g., term frequency). For dense vectors, OneSparse clusters them into several posting lists by SPANN. Besides, it builds a SPTAG in-memory ANN index on cluster centroids to quickly navigate to the nearest SPANN posting lists. The values stored in a SPANN posting list are pairs of ID and dense vector in this cluster. All inverted posting lists and SPANN posting lists are saved on disk.
Figure 3. OneSparse index overview

Unified databases accelerate the development of LLMs and hardware innovation 

As early as 2018, Microsoft Research Asia began in-depth research on vector data systems. “At that time, we realized that vectorization would become the cornerstone of deep learning applications,” Qi Chen said. “Therefore, we developed SPTAG (opens in new tab) and SPANN (opens in new tab) technologies one after another, successfully solving the generalization and scalability problems of vector indexing, and applied it to Microsoft Bing search, achieving the world’s largest vector semantic search system.”
 
Researchers at Microsoft Research Asia continue to explore vector database technology. Based on the relaxed monotonicity and the lightweight update method of the LIRE protocol, they have built a unified database system MSVBASE (opens in new tab), which has been open-sourced on GitHub. The MSVBASE system can be used for semantic analysis of multimodal data, providing developers with powerful tools for researching and utilizing the RAG mechanism and designing more complex RAG retrieval queries. RAG technology will not only be able to perform TopK-based vector queries but also make use of more high-dimensional vector data and attributes for retrieval, achieving more accurate query results. 

In the current age of extensive knowledge expansion, unified databases offer better knowledge transfer between multimodal data types. They provide substantial corpus support for large models and are poised to drive innovation in underlying hardware, laying the foundation for data-enhanced AI in the future.

The post Unified Database: Laying the foundation for large language model vertical applications appeared first on Microsoft Research.

Read More