Introducing AutoGen Studio: A low-code interface for building multi-agent workflows

Introducing AutoGen Studio: A low-code interface for building multi-agent workflows

White icons representing (from left to right) agents (multi), workflow, tasks, and coding on a blue to purple to pink gradient background.

Multi-agent approaches to AI applications, where multiple foundation model-based agents collaborate to solve problems, are emerging as a powerful paradigm for accomplishing increasingly complex tasks. In September 2023, we released AutoGen – a flexible and open-source Python-based framework for defining, configuring, and composing AI agents to drive multi-agent applications. Today, we are introducing AutoGen Studio (version 0.1.0) – a low-code interface for rapidly building, testing, and sharing multi-agent solutions. AutoGen Studio is built on AutoGen and inherits its features and functionalities, while providing a user-friendly and intuitive interface to create and customize agents, with little to no coding required.

During the nine months since it was released, AutoGen (opens in new tab) has been widely adopted by researchers, developers, and enthusiasts who have created a variety of novel and exciting applications (opens in new tab) – from market research to interactive educational tools to data analysis pipelines in the medical domain.  With more than 290 community contributors on GitHub and 890,000 downloads of the Python package (as of May 2024), AutoGen continues to be a leading framework for building and researching multi-agent AI applications.

AutoGen Studio user interface: PDF Book Gen Session
A screenshot of the AutoGen Studio interface shows results when two agents are used to address the task, “Create a 4-page kids’ .pdf book with details and pictures about weather patterns in Seattle”.

AutoGen Studio is the next step forward in enabling developers to advance the multi-agent paradigm. We want to make multi-agent solutions responsibly available to diverse audiences – from academic researchers to professional developers across industries – who want to build multi-agent applications to solve real-world problems. Imagine having access to agents that can automate your vacation planning and grocery shopping, manage your personal finances, help you accomplish your learning goals, or perform any other task you care about. How would you build such agents? What capabilities would you give them? How would you make them work together? How would you ensure they are working as intended?

These questions motivated us to build AutoGen Studio. With AutoGen Studio, developers can rapidly build, test, deploy, and share agents and agent-teams (workflows), with the community. 

Note: AutoGen is primarily a developer tool to enable rapid prototyping and research. It is not a production ready tool. Please see the GitHub repository (opens in new tab) and documentation (opens in new tab) for instructions on how to get started.

What can you do with AutoGen Studio right now?

We built AutoGen Studio with the following goals in mind:  

  • Lower the barrier to entry in building multi-agent applications  
  • Facilitate rapid prototyping and testing of multi-agent solutions
  • Cultivate expertise and community by allowing users to share and re-use this technology 

With AutoGen Studio’s early release (v 0.1.0), users can rapidly author agent workflows via a user interface, interactively test and debug agents, reuse artifacts, and deploy workflows.

The video above shows how users can create skills and models, attach them to agents, create agent workflows, test and deploy them in AutoGen Studio. All in a few clicks.

Rapidly author agent workflows

AutoGen Studio provides a “Build” section where users can choose from a library of pre-defined agents and compose them into teams (workflows) that can address tasks in minutes. Furthermore, users can customize agents and agent teams with foundation models, prompts, skills (python functions that accomplish a specific task e.g., fetching the weather from a weather provider), and workflows via a graphical user interface.  Workflows may be sequential (where agents act in a predefined sequential order) or autonomous chat (where the order in which agents act may be driven by a large language model, custom logic, all based on the state of the task).

AutoGen Studio user interface: agent configuration
In AutoGen Studio, agents can be configured via the user interface. Models and skills can be associated with agents, and agents can be composed into autonomous chat and sequential workflows.

Debug and test agents

AutoGen Studio allows developers to immediately test workflows on a variety of tasks and review resulting artifacts (such as images, code, and documents). Developers can also review the “inner monologue” of agent workflows as they address tasks, and view profiling information such as costs associated with the run (such as number of turns and number of tokens), and agent actions (such as whether tools were called and the outcomes of code execution).

AutoGen Studio user interface: profile sample workflow
AutoGen Studio user interface: sample workflow
In AutoGen Studio, users can test workflows, see results, and view visualizations that profile agent actions (such as how often tools were used or code was executed).

Artifact reuse and deployment

Users can download the skills, agents, and workflow configurations they create as well as share and reuse these artifacts.  AutoGen Studio also offers a seamless process to export workflows and deploy them as application programming interfaces (APIs) that can be consumed in other applications deploying workflows as APIs.

Specifically, workflows can be exported as JavaScript Object Notation (JSON) files and loaded into any python application, launched as an API endpoint from the command line or wrapped into a Dockerfile that can be deployed on cloud services like Azure Container Apps or Azure Web Apps.

AutoGen Studio user interface: export workflow
In AutoGen Studio, users can export agent workflows as a JSON configuration file and then reuse them in any python application, launch it as an API from the command line or deploy on a cloud service like Azure Container Apps and Azure Web Apps.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


What is the community creating with AutoGen Studio?

Over the last few months, we have shared an early version of AutoGen Studio, which has been downloaded more than 154,000 times on pypi (January – May 2024). Our observations of early usage patterns (based on feedback from social platforms like GitHub discussions (opens in new tab) , Discord (opens in new tab) and Youtube (opens in new tab) (opens in new tab)) suggest that AutoGen Studio is driving a new group of users who have basic technical capabilities (that is, they can install the tool) and are interested in rapidly testing out ideas but have limited programming skills.

We have seen these users prototype examples covering tasks like travel planning, pdf brochure generation, market research, structured data extraction, video generation, and visualization generation among others. Importantly, these tasks are accomplished simply by defining agents, giving them access to large language models and skills, adding agents to a workflow, and running tasks with these workflows.

Users are exploring early use cases such as report/book generation, as seen in the screenshot above. Here, two agents are defined and given access to skills for generating images. The agents are then composed into a workflow where messages and actions are exchanged to solve the task of generating a pdf report.

Open research questions and next steps

Orchestrating teams of agents that can explore plans, reflect on actions, and collaborate offers opportunities to build tools that address challenging tasks. We believe that we are just scratching the surface of what may be possible with the multi-agent paradigm, and much is unknown about how best to harness foundation models, let alone foundation model-based agents and multi-agent solutions.

This leaves open many opportunities for further research.

For example, the sophisticated interplay between agents in multi-agent paradigms, particularly for increasingly more complex and dynamic domains, highlights many opportunities for multi-agent evaluation and tooling. Open questions include:

  • How can we measure the performance, reliability, and reusability of agents across tasks?
  • How can we better understand the strengths and limitations of agents?
  • How can we explore alternative scenarios and outcomes?
  • How can we compare different agent architectures and collaboration protocols?

These questions require novel methods and metrics that can capture the multi-faceted aspects of multi-agent paradigms and provide actionable insights for developers and users.

As our understanding of the multi-agent paradigm matures, another opportunity is in distilling design patterns and best practices for building effective agent teams for different types of tasks. For instance:

  • What are the optimal number and composition of agents for a given problem?
  • What is the best way to distribute responsibilities and coordinate actions among agents?
  • What are the trade-offs between centralized and decentralized control, or between homogeneous and heterogeneous agents?
  • How can we leverage human oversight and feedback to improve agent reliability and safety?

These questions require systematic studies and empirical evaluations to discover the key dimensions and principles for designing multi-agent solutions.

Finally, as agents become more long-lived and ubiquitous in our digital world, an open challenge is in automating and optimizing the agent-creation process itself. For example:

  •  How can we dynamically spawn agents based on the task requirements and available resources?
  • How can we tune agent parameter workflow configurations to achieve the best performance?
  • How can we adapt agent teams to changing environments and user preferences?

Future design improvements

Naturally, we see AutoGen Studio as a potential vehicle to study many of these research questions – from improvements in the user experience of authoring workflows to a gallery of shareable artifacts to advanced tools for making sense of agent behaviors.

We are currently working on a new drag-and-drop experience in AutoGen Studio, designed to transform how users’ author multi-agent workflows. Our new visual canvas allows users to easily orchestrate and connect agents, providing an intuitive interface for defining collaboration dynamics.

AutoGen Studio user interface: visual workflow design
A new visual canvas interface for AutoGen allows users to easily orchestrate and connect agents, providing an intuitive interface for defining collaboration dynamics. Entities such as skills and models can be associated with agents via drag-and-drop interactions.

Visual workflow design: The heart of our enhanced user interface is a visual canvas where you can literally see your workflow come to life. Drag and drop different agents onto the canvas to build complex conversation patterns. This graphical approach not only simplifies the initial setup but also makes the process of modifying agents and workflows more intuitive.

A new visual canvas interface for AutoGen that allows users to both visualize agent interactions as well as update properties of each agent in the same view pane.
A new visual canvas interface for AutoGen allows users to both visualize agent interactions and update properties of each agent in the same view pane.

Configurable agents, models, and skills: Customize each agent’s role and skills through simple, direct interactions on the canvas. Whether you’re adding new capabilities or tweaking existing ones, the process is straightforward and user-friendly.

AutoGen Studio user interface: dynamic prototyping and testing
The proposed visual canvas interface for AutoGen will explore updated visualization of agent internal monologues for improved debugging.

Dynamic prototyping and testing: Experimentation is key to perfecting agent workflows. With our new interface, you can prototype various agent configurations and immediately test them in a live environment. This real-time interaction allows you to chat with the workflow, observe all agent messages, and pinpoint areas for improvement on the fly.

AutoGen Studio community gallery
The new proposed design explores a gallery of curated workflows and entities (such as skills and agents) that can be reused.

Finally, we are developing a community gallery within AutoGen Studio where users can share, discover, and learn from one another. This gallery will allow you to publish your workflows, agents, and skills, fostering a collaborative environment where everyone can benefit from shared knowledge and innovations.

Note on responsible AI: Promoting safe and ethical multi-agent solutions

AutoGen Studio is designed to provide a low-code environment for rapidly prototyping and testing multi-agent workflows. Our goal is to responsibly advance research and practice in solving problems with multiple agents and to develop tools that contribute to human well-being. Along with AutoGen, AutoGen Studio is committed to implementing features that promote safe and reliable outcomes. For example, AutoGen Studio offers profiling tools to make sense of agent actions and safeguards, such as support for Docker environments for code execution. This feature helps ensure that agents operate within controlled and secure environments, reducing the risk of unintended or harmful actions. For more information on our approach to responsible AI in AutoGen,  please refer to transparency FAQS here: https://github.com/microsoft/autogen/blob/main/TRANSPARENCY_FAQS.md (opens in new tab). Finally, AutoGen Studio is not production ready i.e., it does not focus on implementing authentication and other security measures that are required for production ready deployments.

Acknowledgements 

We would like to thank members of the open-source software (OSS) community and the AI Frontiers organization at Microsoft for discussions and feedback along the way. Specifically, we would like to thank Piali Choudhury, Ahmed Awadallah, Robin Moeur, Jack Gerrits, Robert Barber, Grace Proebsting, Michel Pahud, and others for feedback and comments.

The post Introducing AutoGen Studio: A low-code interface for building multi-agent workflows appeared first on Microsoft Research.

Read More

Ideas: Solving network management puzzles with Behnaz Arzani

Ideas: Solving network management puzzles with Behnaz Arzani

Microsoft Research Podcast | Ideas | Behnaz Arzani

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. 

In this episode, host Gretchen Huizinga talks with Principal Researcher Behnaz Arzani. Arzani has always been attracted to hard problems, and there’s no shortage of them in her field of choice—network management—where her contributions to heuristic analysis and incident diagnostics are helping the networks people use today run more smoothly. But the criteria she uses to determine whether a challenge deserves her time has evolved. These days, a problem must appeal across several dimensions: Does it answer a hard technical question? Would the solution be useful to people? And … would she enjoy solving it?

Transcript

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE]

BEHNAZ ARZANI: I guess the thing I’m seeing is that we are freed up to dream more—in a way. Maybe that’s me being too … I’m a little bit of a romantic, so this is that coming out a little bit, but it’s, like, because of all this, we have the time to think bigger, to dream bigger, to look at problems where maybe five years ago, we wouldn’t even dare to think about.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

My guest today is Behnaz Arzani. Behnaz is a principal researcher at Microsoft Research, and she’s passionate about the systems and networks that provide the backbone to nearly all our technologies today. Like many in her field, you may not know her, but you know her work: when your networks function flawlessly, you can thank people like Behnaz Arzani. Behnaz, it’s been a while. I am so excited to catch up with you today. Welcome to Ideas!


BEHNAZ ARZANI: Thank you. And I’m also excited to be here.

HUIZINGA: So since the show is about ideas and leans more philosophical, I like to start with a little personal story and try to tease out anything that might have been an inflection point in your life, a sort of aha moment, or a pivotal event, or an animating “what if,” we could call it. What captured your imagination and got you inspired to do what you’re doing today?

ARZANI: I think that it was a little bit of an accident and a little bit of just chance, I guess, but for me, this happened because I don’t like being told what to do! [LAUGHTER] I really hate being told what to do. And so, I got into research by accident, mostly because it felt like a job where that wouldn’t happen. I could pick what I wanted to do. So, you know, a lot of people come talking about how they were the most curious kids and they all—I wasn’t that. I was a nerd, but I wasn’t the most curious kid. But then I found that I’m attracted to puzzles and hard puzzles and things that I don’t know how to answer, and so that gravitated me more towards what I’m doing today. Things that are basically difficult to solve … I think are difficult to solve.

HUIZINGA: So that’s your inspiring moment? “I’m a bit of a rebel, and …”

ARZANI: Yup!

HUIZINGA: … I like puzzles … ”?

ARZANI: Yup! [LAUGHTER] Which is not really a moment. Yeah, I can’t point to a moment. It’s just been a journey, and it’s just, like, been something that has gradually happened to me, and I love where I am …

HUIZINGA: Yeah …

ARZANI: … but I can’t really pinpoint to like this, like this inspiring awe-drop—no.

HUIZINGA: OK. So let me ask you this: is there nobody in this building that tells you what to do? [LAUGHS]

ARZANI: There are people who have tried, [LAUGHS] but …

HUIZINGA: Oh my gosh!

ARZANI: No, it doesn’t work. And I think if you ask them, they will tell you it hasn’t worked.

HUIZINGA: OK. The other side question is, have you encountered a puzzle that has confounded you?

ARZANI: Have I encountered a puzzle? Yes. Incident management. [LAUGHTER]

HUIZINGA: And we’ll get there in the next couple of questions. Before we do, though, I want to know about who might have influenced you earlier. I mean, it’s interesting. Usually if you don’t have a what, there might not be a who attached to it …

ARZANI: No. But I have a who. I have multiple “whos” actually.

HUIZINGA: OK! Wonderful. So tell us a little bit about the influential people in your life.

ARZANI: I think the first and foremost is my mom. I have a necklace I’m holding right now. This is something my dad gave my mom on their wedding day. On one side of it is a picture of my mom and dad; on the other side is both their names on it. And I have it on every day. To my mom’s chagrin. [LAUGHTER] She is like, why? But it’s, like, it helps me stay grounded. And my mom is a person that … she had me while she was an undergrad. She got her master’s. She got into three different PhD programs in her lifetime. Every time, she gave it up for my sake and for my brother’s sake. But she’s a woman that taught me you can do anything you set your mind to and that you should always be eager to learn. She was a chemistry teacher, and even though she was a chemistry teacher, she kept reading new books. She came to the US to visit me in 2017, went to a Philadelphia high school, and asked, can I see your chemistry books? I want to see what you’re teaching your kids. [LAUGHTER] So that’s how dedicated she is to what she does. She loves what she does. And I could see it on her face on a daily basis. And at some point in my life a couple of years ago, I was talking to my mom about something, and she said, tell yourself, “I’m stronger than my mom.”

HUIZINGA: Oh my gosh.

ARZANI: And that has been, like, the most amazing thing to have in the back of my head because I view my mom as one of the strongest people I’ve ever met, and she’s my inspiration for everything I do.

HUIZINGA: Tell yourself you’re stronger than your mom. … Did you?

ARZANI: I’m not stronger than my mom, I don’t think … [LAUGHS]

HUIZINGA: [LAUGHS] You got to change that narrative!

ARZANI: But, yes, I think it’s just this thing of, like, “What would Mom do?” is a great thing to ask yourself, I think.

HUIZINGA: I love that. Well, and so I would imagine, though, that post-, you know, getting out of the house, you’ve had instructors, you’ve had professors, you’ve had other researchers. I mean, anyone else that’s … ?

ARZANI: Many! And in different stages of your life, different people step into that role, I feel like. One of the first people for me was Jen Rexford, and she is just an amazing human being. She’s an amazing researcher, hands down. Her work is awesome, but also, she’s an amazing human being, as well. And that just makes it better.

HUIZINGA: Yeah.

ARZANI: And then another person is Mohammad Alizadeh, who’s at MIT. And actually, let’s see, I’m going to keep going …

HUIZINGA: Good.

ARZANI: a little with people—Mark Handley. When I was a PhD student, I would read their papers, and I’d be like, wow! And, I want to be like you!

HUIZINGA: So linking that back to your love of puzzles, were these people that you admired good problem solvers or … ?

ARZANI: Oh, yeah! I think Jen is one of those who … a lot of her work is also practical, like, you know, straddles a line between both solving the puzzle and being practical and being creative and working with theorists and working with PL people. So she’s also collaborative, which is, kind of, my style of work, as well. Mohammad is more of a theorist, and I love … like more the theoretical aspect of problems that I solve. And so, like, just the fact that he was able to look at those problems and thinks about those problems in those ways. And then Mark Handley’s intuition about problems—yeah, I can’t even speak to that!

HUIZINGA: That’s so fascinating because you’ve identified three really key things for a researcher. And each one is embodied in a person. I love that. And because I know who you are, I know we’re going to get to each of those things probably in the course of all these questions that I’ll ask you. [LAUGHTER] So we just spent a little time talking about what got you here and who influenced you along the way. But your life isn’t static. And at each stage of accomplishment, you get a chance to reflect and, sort of, think about what you got right, what you got wrong, and where you want to go next. So I wonder if you could take a minute to talk about the evolution of your values as a researcher, collaborator, and colleague and then a sort of “how it started/how it’s going” thing.

ARZANI: Hmm … For me, I think what I’ve learned is to be more mindful—about all of it. But I think if I talk about the evolution, when you’re a PhD student, especially if you’re a PhD student from a place that’s not MIT, that’s not Berkeley, which is where I was from, my main focus was proving myself. I mean, for women, always, we have to prove ourselves. But, like, I think if you’re not from one of those schools, it’s even more so. At least that’s how I felt. That might not be the reality, but that’s how you feel. And so you’re always running to show this about yourself. And so you don’t stop to think how you’re showing up as a person, as a researcher, as a collaborator. You’re not even, like, necessarily reflecting on, are these the problems that I enjoy solving? It’s more of, will solving this problem help me establish myself in this world that requires proving yourself and is so critical and all of that stuff? I think now I stop more. I think more, is this a problem that I would enjoy solving? I think that’s the most important thing. Would other people find it useful? Is it solving a hard technical question? And then, in collaborations, I’m being more mindful that I show up in a way that basically allows me to be a good person the way I want to be in my collaboration. So as researchers, we have to be critical because that’s how science evolves. Not all work is perfect. Not all ideas are the best ideas. That’s just fundamental truth. Because we iterate on each other’s ideas until we find the perfect solution to something. But you can do all of these things in a way that’s kind, in a way that’s mindful, in a way that respects other people and what they bring to the table. And I think what I’ve learned is to be more mindful about those things.

HUIZINGA: How would you define mindful? That’s an interesting word. It has a lot of baggage around it, you know, in terms of how people do mindfulness training. Is that what you’re talking about, or is it more, sort of, intentional?

ARZANI: I think it’s both. So I think one of the things I said—I think when I got into this booth even—was, I’m going to take a breath before I answer each question. And I think that’s part of it, is just taking a breath to make sure you’re present is part of it. But I think there is more to it than that, which is I don’t think we even think about it. I think if I … when you asked me about the evolution of how I evolved, I never thought about it.

HUIZINGA: No.

ARZANI: I was just, like, running to get things done, running to solve the question, running to, you know, find the next big thing, and then you’re not paying attention to how you’re impacting the world in the process.

HUIZINGA: Right.

ARZANI: And once you start paying attention, then you’re like, oh, I could do this better. I can do that better. If I say this to this person in that way, that allows them to do so much more, that encourages them to do so much more.

HUIZINGA: Yeah, yeah.

ARZANI: So …

HUIZINGA: You know, when you started out, you said, is this a problem I would enjoy solving? And then you said, is this a problem that somebody else needs to have solved? Which is sort of like “do I like it?”—it goes back to Behnaz at the beginning: don’t tell me what to do; I want to do what I want to do. Versus—or and is this useful to the world? And I feel like those two threads are really key to you.

ARZANI: Yes. Basically, I feel like that defines me as a researcher, pretty much. [LAUGHS] Which is, you know, I was one of the, you know, early people … I wouldn’t say first. I’m not the first, I don’t think, but I was one of the early people who was talking about using machine learning in networking. And after a while, I stopped because I wasn’t finding it fun anymore, even though there was so much hype about, you know, let’s do machine learning in networking. And it’s not because there’s not a lot of technical stuff left to do. You can do a lot of other things there. There’s room to innovate. It’s just that I got bored.

HUIZINGA: I was just going to say, it’s still cool, but Behnaz is bored! [LAUGHTER] OK, well, let’s start to talk a little bit about some of the things that you’re doing. And I like this idea of a researcher, even a person, having a North Star goal. It sounds like you’ve got them in a lot of areas of your life, and you’ve said your North Star goal, your research goal, is to make the life of a network operator as painless as possible. So I want to know who this person is. Walk us through a day in the life of a network operator and tell us what prompted you to want to help them.

ARZANI: OK, so it’s been years since I actually, like, sat right next to one of them for a long extended period of time because now we’re in different buildings, but back when I was an intern, I was actually, like, kind of, like right in the middle of a bunch of, you know, actual network operators. And what I observed … and see, this was not, like, I’ve never lived that experience, so I’m talking about somebody else’s experience, so bear that in mind …

HUIZINGA: Sure, but at least you saw it …

ARZANI: Yeah. What they do is, there’s a lot of, “OK, we design the network, configure it.” A lot of it goes into building new systems to manage it. Building new systems to basically make it better, more efficient, all of that. And then they also have to be on call so that when any of those things break, they’re the ones who have to look at their monitoring systems and figure out what happened and try to fix it. So they do all of this in their day-to-day lives.

HUIZINGA: That’s tough …

ARZANI: Yeah.

HUIZINGA: OK. So I know you have a story about what prompted you, at the very beginning, to want to help this person. And it had some personal implications. [LAUGHS]

ARZANI: Yeah! So my internship mentor, who’s an amazing person, I thought—and this is, again, my perception as an intern—the day after he was on call, he was so tired, I felt. And so grumpy … grumpier than normal! [LAUGHTER] And, like, my main motivation initially for working in this space was just, like, make his life better!

HUIZINGA: Make him not grumpy.

ARZANI: Yeah. Pretty much. [LAUGHS]

HUIZINGA: Did you have success at that point in your life? Or was this just, like, setting a North Star goal that I’m going to go for that?

ARZANI: I mean, I had done a lot of work in monitoring space, but back then—again, going back to the talk we were having about how to be mindful about problems you pick—back then it was just like, oh, this was a problem to solve, and we’ll go solve it, and then what’s the next thing? So there was not an overarching vision, if you will. It was just, like, going after the next, after the next. I think that’s a point where, like, it all came together of like, oh, all of the stuff that I’m doing can help me achieve this bigger thing.

HUIZINGA: Right. OK, Behnaz, I want to drop anchor, to use a seafaring analogy, for a second and contextualize the language that these operators use. Give us a “networking for neophytes” overview of the tools they rely on and the terminology they use in their day-to-day work so we’re not lost when we start to unpack the problems, projects, and papers that are central to your work.

ARZANI: OK. So I’m going to focus on my pieces of this just because of the context of this question. But a lot of operators … just because a lot of the problems that we work on these days to be able to manage our network, the optimal form of these problems tend to be really, really hard. So a lot of the times, we use algorithms and solutions that are approximate forms of those optimal solutions in order to just solve those problems faster. And a lot of these heuristics, some of them focus on our wide area network, which we call a WAN. Our WANs, basically what they do is they move traffic between datacenters in a way that basically fits the capacity of our network. And, yeah, I think for my work, my current work, to understand it, that’s, I think, enough networking terminology.

HUIZINGA: OK. Well, so you’ve used the term heuristic and optimal. Not with an “s” on the end of it. Or you do say “optimals,” but it’s a noun …

ARZANI: Well, so for each problem definition, usually, there’s one way to formulate an optimal solution. There might be multiple optima that you find, but the algorithm that finds the optimum usually is one. But there might be many, I guess. The ones that I’ve worked on generally have been one.

HUIZINGA: Yeah, yeah. And so in terms of how things work on a network, can you give us just a little picture of how something moves from A to B that might be a problem?

ARZANI: So, for example, we have these datacenters that generate terabytes of traffic and—terabytes per second of traffic—that wants to move from point A to point B, right. And we only have finite network capacity, and these, what we call, “demands” between these datacenters—and you didn’t see me do the air quotes, but I did the air quotes—so they go from point A to point B, and so in order to fit this demand in the pipes that we have—and these pipes are basically links in our network—we have to figure out how to send them. And there’s variations in them. So, like, it might be the case that at a certain time of the day, East US would want to send more traffic to West US, and then suddenly, it flips. And that’s why we solve this problem every five minutes! Now assume one of these links suddenly goes down. What do I do? I have to resolve this problem because maybe the path that I initially picked for traffic to go through goes exactly through that failed link. And now that it’s disappeared, all of that traffic is going to fall on the floor. So I have to re-solve that problem really quickly to be able to re-move my traffic and move it to somewhere else so that I can still route it and my customers aren’t impacted. What we’re talking about here is a controller, essentially, that the network operators built. And this controller solves this optimization problem that figures out how traffic should move. When it’s failed, then the same controller kicks in and reroutes traffic. The people who built that controller are the network operators.

HUIZINGA: And so who does the problem-solving or the troubleshooting on the fly?

ARZANI: So hopefully—and this, most of the times, is the case—is we have monitoring systems in place that the operators have built that, like, kind of, signal to this controller that, oh, OK, this link is down; you need to do something.

[MUSIC BREAK]

HUIZINGA: Much of your recent work represents an effort to reify the idea of automated network management and to try to understand the performance of deployed algorithms. So talk about the main topics of interest here in this space and how your work has evolved in an era of generative AI and large language models.

ARZANI: So if you think about it, what generative AI is going to enable, and I’m using the term “going to enable” a little bit deliberately because I don’t think it has yet. We still have to build on top of what we have to get that to work. And maybe I’ll reconsider my stance on ML now that, you know, we have these tools. Haven’t yet but might. But essentially, what they enable us to do is take automated action on our networks. But if we’re allowing AI to do this, we need to be mindful of the risks because AI in my, at least in my head of how I view it, is a probabilistic machine, which, what that means is that there is some probability, maybe a teeny tiny probability, it might get things wrong. And the thing that you don’t want is when it gets things wrong, it gets things catastrophically wrong. And so you need to put guardrails in place, ensure safety, figure out, like, for each action be able to evaluate that action and the risks it imposes long term on your network and whether you’re able to tolerate that risk. And I think there is a whole room of innovation there to basically just figure out the interaction between the AI and the network and where … and actually strategic places to put AI, even.

HUIZINGA: Right.

ARZANI: The thing that for me has evolved is I used to think we just want to take the human out of the equation of network management. The way I think about it now is there is a place for the human in the network management operation because sometimes human has context and that context matters. And so I think what the, like, for example, we have this paper in HotNets 2023 where we talk about how to put an LLM in the incident management loop, and then there, we carefully talk about, OK, these are the places a human needs to be involved, at least given where LLMs are right now, to be able to ensure that everything happens in a safe way.

HUIZINGA: So go back to this “automated network management” thing. This sounds to me like you’re in a space where it could be, but it isn’t ready yet …

ARZANI: Yeah.

HUIZINGA: … and without, sort of, asking you to read a crystal ball about it, do you feel like this is something that could be eventually?

ARZANI: I hope so. This is the best thing about research. You get to be like, yeah!

HUIZINGA: Yeah, why not?

ARZANI: Why not? And, you know, maybe somebody will prove me wrong, but until they do, that’s what I’m working towards!

HUIZINGA: Well, right now it’s an animating “what if?”

ARZANI: Yeah.

HUIZINGA: Right?

ARZANI: Yeah.

HUIZINGA: This is a problem Behnaz is interested in right now. Let’s go!

ARZANI: Yeah. Pretty much. [LAUGHTER]

HUIZINGA: OK. Behnaz, the systems and networks that we’ve come to depend on are actually incredibly complex. But for most of us, most of the time, they just work. There’s only drama when they don’t work, right? But there’s a lot going on behind the scenes. So I want you to talk a little bit about how the cycle of configuring, managing, reconfiguring, etc., helps keep the drama at bay.

ARZANI: Well … you reminded me of something! So when I was preparing my job … I’m going to tell this story really, really quickly. But when I was preparing my job talk, somebody showed me a tweet. In 2014, I think, people started calling 911 when Facebook was down! Because of a networking problem! [LAUGHS] Yeah. So that’s a thing. But, yeah, so network availability matters, and we don’t notice it until it’s actually down. But that aside, back to your question. So I think what operators do is they build systems in a way that tries to avoid that drama as much as possible. So, for example, they try to build systems that these systems configure the network. And one of my dear friends, Ryan Beckett, works on intent-driven networking that essentially tries to ensure that what the operators intend with their configurations matches what they actually push into the network. They also monitor the network to ensure that as soon as something bad happens, automation gets notified. And there’s automation also that tries to fix these problems when they happen as much as possible. There’s a couple of problems that happen in the middle of this. One of them is our networks continuously change, and what we use in our networks changes. And there’s so many different pieces and components of this, and sometimes what happens is, for example, a team decides to switch from one protocol to a different protocol, and by doing that, it impacts another team’s systems and monitoring and what expectations they had for their systems, and then suddenly it causes things to go bad …

HUIZINGA: Right.

ARZANI: And they have to develop new solutions taking into account the changes that happened. And so one of the things that we need to account for in this whole process is how evolution is happening. And like evolution-friendly, I guess, systems, maybe, is how you should be calling it.

HUIZINGA: Right.

ARZANI: But that’s one. The other part of it that goes into play is, most of the time you expect a particular traffic characteristic, and then suddenly, you have one fluke event that, kind of, throws all of your assumptions out the window, so …

HUIZINGA: Right. So it’s a never-ending job …

ARZANI: Pretty much.

HUIZINGA: It’s about now that I ask all my guests what could possibly go wrong if, in fact, you got everything right. And so for you, I’d like to earth this question in the broader context of automation and the concerns inherent in designing machines to do our work for us. So at an earlier point in your career—we talked about this already—you said you believed you could automate everything. Cool. Now you’re not so much on that. Talk about what changed your thinking and how you’re thinking now.

ARZANI: OK, so the shallow answer to that question—there’s a shallow answer, and there’s a deeper answer—the shallow answer to that question is I watched way too many movies where robots took over the world. And honestly speaking, there’s a scenario that you can imagine where automation starts to get things wrong and then keeps getting things wrong, and wrong, not by the definition of automation. Maybe they’re doing things perfectly by the objectives and metrics that you used to design them …

HUIZINGA: Sure.

ARZANI: … but they’re screwing things up in terms of what you actually want them to do.

HUIZINGA: Interesting.

ARZANI: And if everything is automated and you don’t leave yourself an intervention plan, how are you going to take control back?

HUIZINGA: Right. So this goes back to the humans-in-the-loop/humans-out-of-the-loop. And if I remember in our last podcast, we were talking about humans out of the loop.

ARZANI: Yeah.

HUIZINGA: And you’ve already talked a bit about what the optimal place for a human to be is. Is the human always going to have to be in the loop, in your opinion?

ARZANI: I think it’s a scenario where you always give yourself a way to interrupt. Like, always put a back door somewhere. When we notice things go bad, we have a way that’s foolproof that allows us to shut everything down and take control back to ourselves. Maybe that’s where we go.

HUIZINGA: How do you approach the idea of corner cases?

ARZANI: That’s essentially what my research right now is, actually! And I love it, which is essentially figuring out, in a foolproof way, all the corner cases.

HUIZINGA: Yeah?

ARZANI: Can you build a tool that will tell you what the corner cases are? Now, granted, what we focus on is performance corner cases. Nikolaj Bjørner, in RiSE—so RiSE is Research in Software Engineering—is working on, how do you do verification corner cases? But all of them, kind of, have a hand-in-hand type of, you know, Holy Grail goal, which is …

HUIZINGA: Sure.

ARZANI: … how do you find all the corner cases?

HUIZINGA: Right. And that, kind of, is the essence of this “What could possibly go wrong?” question, is looking in every corner …

ARZANI: Correct.

HUIZINGA: … for anything that could go wrong. So many people in the research community have observed that the speed of innovation in generative AI has shrunk the traditional research-to-product timeline, and some people have even said everyone’s an applied researcher now. Or everyone’s a PM. [LAUGHS] Depends on who you are! But you have an interesting take on this Behnaz, and it reminds me of a line from the movie Nanny McPhee: “When you need me but do not want me, then I will stay. When you want me but no longer need me, I have to go.” So let’s talk a little bit about your perspective on this idea-to-ideation pipeline. How and where are researchers in your orbit operating these days, and how does that impact what we might call “planned obsolescence” in research?

ARZANI: I guess the thing I’m seeing is that we are freed up to dream more—in a way. Maybe that’s me being too … I’m a little bit of a romantic, so this is that coming out a little bit, but it’s, like, because of all this, we have the time to think bigger, to dream bigger, to look at problems where maybe five years ago, we wouldn’t even dare to think about. We have amazingly, amazingly smart, competent people in our product teams. Some of them are actually researchers. So there’s, for example, the Azure systems research group that has a lot of people that are focused on problems in our production systems. And then you have equivalents of those spread out in the networking sphere, as well. And so a lot of complex problems that maybe like 10 years ago Microsoft Research would look at nowadays they can handle themselves. They don’t need us. And that’s part of what has allowed us to now go and be like, OK, I’m going to think about other things. Maybe things that, you know, aren’t relevant to you today, but maybe in five years, you’ll come in and thank me for thinking about this!

HUIZINGA: OK. Shifting gears here! In a recent conversation, I heard a colleague refer to you as an “idea machine.” To me, that’s one of the greatest compliments you could get. But it got me wondering, so I’ll ask you: how does your brain work, Behnaz, and how do you get ideas?

ARZANI: Well, this has been, to my chagrin, one of the realities of life about my brain apparently. So I never thought of this as a strength. I always thought about it as a weakness. But nowadays, I’m like, oh, OK, I’m just going to embrace this now! So I have a random brain. It’s completely ran—so, like, it actually happens, like, you’re talking, and then suddenly, I say something that seems to other people like it came out of left field. I know how I got there. It’s essentially kind of like a Markov chain. [LAUGHTER] So a Markov chain is essentially a number of states, and there’s a certain probability you can go from one state to the other state. And, actually, one of the things I found out about myself is I think through talking for this exact reason. Because people see this random Markov chain by what they say, and it suddenly goes into different places, and that’s how ideas come about. Most of my ideas have actually come through when I’ve been talking to someone.

HUIZINGA: Really?

ARZANI: Yeah.

HUIZINGA: Them talking or you talking?

ARZANI: Both.

HUIZINGA: Really?

ARZANI: So it’s, like, basically, I think the thing that has recently … like, I’ve just noticed more—again, being more mindful does that to you—it’s like I’m talking to someone. I’m like, I have an idea. And it’s usually they said something, or I was saying something that triggered that thought coming up. Which doesn’t happen when … I’m not one of those people that you can put in a room for three days—somebody actually once told me this— [LAUGHTER] like, I’m not one of those people you can put in a room for three days and I come out with these brilliant ideas. It’s like you put me in a room with five other people, then I come out with interesting ideas.

HUIZINGA: Right. … It’s the interaction.

ARZANI: Yeah.

HUIZINGA: I want to link this idea of the ideas that you get to the conversations you have and maybe go back to linking it to the work you’ve recently done. Talk about some of the projects, how they came from idea to paper to product even …

ARZANI: Mm-hm. So like one of the works that we were doing was this work on, like, max-min fair resource allocation that recently got published in NSDI and is actually in production. So the way that came out is I was working with a bunch of other researchers on risk estimation, actually, for incident management of all things, which was, how do you figure out if you want to mitigate a particular problem in a certain way, how much risk it induces as a problem. And so one of the people who was originally … one of the original researchers who built our wide-area traffic engineering controller, which we were talking about earlier, he said, “You’re solving the max-min fair problem.” We’re like, really? And then this caused a whole, like, one-year collaboration where we all sat and evolved this initial algorithm we had into a … So initially it was not a multipath problem. It had a lot of things that didn’t fully solve the problem of max-min fair resource allocation, but it evolved into that. Then we deployed it, and it improved the SWAN solver by a factor of three in terms of how fast it solved the problem and didn’t have any performance impact, or at least very little. And so, yeah, that’s how it got born.

HUIZINGA: OK. So for those of us who don’t know, what is max-min fair resource allocation, and why is it such a problem?

ARZANI: Well, so remember I said that in our wide area network, we route traffic from one place to the other in a way that meets capacity. So one of the objectives we try to meet is we try to be fair in a very specific metric. So max-min is just the metric of fairness we use. And that basically means you cannot improve what you allocated to one piece of traffic in a way that would hurt anybody who has gotten less. So there’s a little bit of a, like, … it’s a mind bend to wrap your head a little bit around the max-min fair definition. But the reason making it faster is important is if something fails, we need to quickly recompute what the paths are and how we route traffic. So the faster we can solve this problem, the better we can adapt to failures.

HUIZINGA: So talk a little bit about some of the work that started as an idea and you didn’t even maybe know that it was going to end up in production.

ARZANI: There was this person from Azure Networking came and gave a talk in our group. And he’s a person I’ve known for years, so I was like, hey, do you want to jump on a meeting and talk? So he came into that meeting, and I was like, OK, what are some of the things you’re curious about these days? You want to answer these days? And it was like, yeah, we have this heuristic we’re using in our traffic engineering solution, and essentially what it does is to make the optimization problem we solve smaller. If a piece of traffic is smaller than a particular, like, arbitrary threshold, we just send it on a shortest path and don’t worry about it. And then we optimize everything else. And I just want to know, like, what is the optimality gap of this heuristic? How bad can this heuristic be? And then I had worked on Stackelberg games before, in my PhD. It never went anywhere, but it was an idea I played around with, and it just immediately clicked in my head that this is the same problem. So Stackelberg games are a leader-follower game where in this scenario a leader has an objective function that they’re trying to maximize, and they control one or multiple of the inputs that their followers get to operate over. The followers, on the other hand, don’t get to control anything about this input. They have their own objective that they’re trying to maximize or minimize, but they have other variables in their control, as well. And what their objective is, is going to control the leader’s payoff. And so this game is happening where the leader has more control in this game because it’s, kind of, like the followers are operating in subject to whatever the leader says, … right. But the leader is impacted by what the followers do. And so this dynamic is what they call a Stackelberg game. And the way we map the MetaOpt problem to this is the leader in our problem wants to maximize the difference between the optimal and the heuristic. It controls the inputs to both the optimal and the heuristic. And now this optimal and heuristic algorithms are the followers in that game. They don’t get to control the inputs, but they have other variables they control, and they have objectives that they want to maximize or minimize.

HUIZINGA: Right.

ARZANI: And so that’s how the Stackelberg-game dynamic comes about. And then we got other researchers in the team involved, and then we started talking, and then it just evolved into this beast right now that is a tool, MetaOpt, that we released, I think, a couple of months ago. And another piece that was really cool was people from ETH Zürich came to us and were like, oh, you guys analyzed our heuristic! We have a better one! Can you analyze this one? And that was a whole fun thing we did where we analyzed their heuristics for them. And, then, yeah …

HUIZINGA: Yeah. So all these things that you’re mentioning, are they findable as papers? Were they presented …

ARZANI: Yes.

HUIZINGA: … at conferences, and where are they in anybody’s usability scenario?

ARZANI: So the MetaOpt tool that I just mentioned, that one is in … it’s an open-source tool. You can go online and search for MetaOpt. You’ll find the tool. We’re here to support anything you need; if you run into issues, we’ll help you fix it.

HUIZINGA: Great. You can probably find all of these papers under publications …

ARZANI: Yes.

HUIZINGA: … on your bio page on the website, Microsoft Research website.

ARZANI: Correct.

HUIZINGA: Cool. If anyone wants to do that. So, Behnaz, the idea of having ideas is cool to me, but of course, part of the research problem is identifying which ones you should go after [LAUGHS] and which ones you shouldn’t. So, ironically, you’ve said you’re not that good at that part of it, but you’re working at getting better.

ARZANI: Yes.

HUIZINGA: So first of all, why do you say that you’re not very good at it? And second of all, what are you doing about it?

ARZANI: So I, as I said, get attracted to puzzles, to hard problems. So most of the problems that I go after are problems I have no idea how to solve. And that tends to be a risk.

HUIZINGA: Yeah.

ARZANI: Where I think people who are better at selecting problems are those who actually have an idea of whether they’ll be able to solve this problem or not. And I never actually asked myself that question before this year. [LAUGHTER] So now I’m trying to get a better sense of, how do I figure out if a problem is solvable or not before I try to solve it? And also, just what makes a good research problem? So what I’m doing is, I’m going back to the era that I thought had the best networking papers, and I’m just trying to dissect what makes those papers good, just to understand better for myself, to be like, OK, what do I want to replicate? Replicate, not in terms of techniques, but in terms of philosophy.

HUIZINGA: So what you’re looking at is how people solve problems through the work that they did in this arena. So what are you finding? Have you gotten any nuggets of …

ARZANI: So a couple. So one of my favorite papers is Van Jacobson’s TCP paper. The intuition is amazing to me. It’s almost like he has a vision of what’s happening, is the best I can describe it. And another example of this is also early-on papers by people like Ratul Mahajan, Srikanth Kandula, those guys, where you see that they start with a smaller example that, kind of, shows how this problem is going to happen and how they’re going to solve it. I mean, I did this in my work all the time, too, but it was never conscious. It’s more of like that goes to that mindfulness thing that I said before, too. It’s like you might be doing some of these already, but you don’t notice what you’re doing. It more of is, kind of, like putting of like, oh, this is what they did. And I do this, too. And this might be a good habit to keep but cultivate into a habit as opposed to an unconscious thing that you’re just doing.

HUIZINGA: Right. You know, this whole idea of going back to what’s been done before, I think that’s a lesson about looking at history, as well, and to say, you know, what can we learn from that? What are we trying to reinvent …

ARZANI: Yeah.

HUIZINGA: … that maybe doesn’t need to be reinvented? Has it helped you to get more targeted on the kinds of problems that you say, “I’m not going to work on that. I am going to work on that”?

ARZANI: To be very, very, very fair, I haven’t done this for a long time yet! This has been …

HUIZINGA: A new thing.

ARZANI: I started this this month, yeah.

HUIZINGA: Oh my goodness!

ARZANI: So we’ll see how far I get and how useful it ends up being! [LAUGHS] [MUSIC BREAK]

HUIZINGA: One of my favorite things to talk about on this show is what my colleague Kristina calls “outrageous” lines of research. And so I’ve been asking all my guests about their most outrageous ideas and how they turned out. So sometimes these ideas never got off the ground. Sometimes they turned out great. And other times, they’ve failed spectacularly. Do you have a story for the “Microsoft Research Outrageous Ideas” file?

ARZANI: I had this question of, if language has grammar, and grammar is what LLMs are learning, which, to my understanding of what people who are experts in this field say, this maybe isn’t that, but if it is the case that grammar is what allows these LLMs to learn how language works, then in networking, we have the equivalent of that, and the equivalent of that is essentially network protocols. And everything that happens in a network, you can define it as an event that happens in a network. You can think of those, like, the events are words in a language. And so, is it going to be the case, and this is a question which is, if you take an event abstraction and encode everything that happens in a network in that event abstraction, can you build an equivalent of an LLM for networks? Now what you would use it for—this is another reason I’ve never worked on this problem—I have no idea! [LAUGHTER] But what this would allow you to do is build the equivalent of an LLM for networking, where actually you just translate that network’s events into, like, this event abstraction, and then the two understand each other. So like a universal language of networking, maybe. It could be cool. Never tried it. Probably a dumb idea! But it’s an idea.

HUIZINGA: What would it take to try it?

ARZANI: Um … I feel like bravery is, I think, one because with any risky idea, there’s a probability that you will fail.

HUIZINGA: As a researcher here at Microsoft Research, when you have this idea, um … and you say, well, I’m not brave enough … even if you were brave enough, who would you have to convince that they should let you do it?

ARZANI: I don’t think anybody!

HUIZINGA: Really?

ARZANI: That’s the whole … that’s the whole point of me being here! I don’t like being told what to do! [LAUGHS]

HUIZINGA: Back to the beginning!

ARZANI: Yeah. The only thing is that, maybe, like, people would be like, what have you been doing in the past six months? And I wouldn’t have … that’s the risk. That’s where bravery comes in.

HUIZINGA: Sure.

ARZANI: The bravery is more of there is a possibility that I have to devote three years of my life into this, to figuring out how to make that work, and I might not be able to.

HUIZINGA: Yes …

ARZANI: And there’s other things. So it’s a tradeoff also of where you put your time.

HUIZINGA: Sure.

ARZANI: So there. Yeah.

HUIZINGA: And if, but … part of it would be explaining it in a way to convince people: if it worked, it would be amazing!

ARZANI: And that’s the other problem with this idea. I don’t know what you would use it for. If I knew what you would use it for, maybe then it would make it worth it.

HUIZINGA: All right. Sounds like you need to spend some more time …

ARZANI: Yeah.

HUIZINGA: …ruminating on it. Um, yeah. The whole cliché of the solution in search of a problem.

ARZANI: Yeah.

HUIZINGA: [LAUGHS] As we close, I want to talk a little bit about some fun things. And so, aside from your research life, I was intrigued by the fact, on your bio page, that you have a rich artistic life, as well, and that includes painting, music, writing, along with some big ideas about the value of storytelling. So I’ll take a second to plug the bio page. People, go look at it because she’s got paintings and cool things that you can link to. As we close, I wonder if you could use this time to share your thoughts on this particular creative pursuit of storytelling and how it can enhance our relationships with our colleagues and ultimately make us better researchers and better people?

ARZANI: I think it’s not an understatement to say I had a life-changing experience through storytelling. The first time I encountered it, it was the most horrific thing I had ever seen! I had gone on Meetup—this was during COVID—to just, like, find places to meet people, build connections and all that, and I saw this event called “Storytelling Workshop,” and I was like, good! I’m good at making up stories, and, you know, that’s what I thought it was. Turns out it’s, you go and tell personal stories about your life that only involve you, that make you deeply vulnerable. And, by the way, I’m Iranian. We don’t do vulnerability. It’s just not a thing. So it was the most scary thing I’ve ever done in my life. But you go on stage and basically talk about your life. And the thing it taught me by both telling my own stories and listening to other people’s stories is that it showed me that you can connect to people through stories, first of all. The best ideas come when you’re actually in it together. Like one of the things that now I say that I didn’t used to say, we, we’re all human. And being human essentially means we have good things about ourselves and bad things about ourselves. And as researchers, we have our strengths as researchers, and we have our weaknesses as researchers. And so when we collaborate with other people, we bring all of that. And collaboration is a sacred thing that we do where we’re basically trusting each other with bringing all of that to the table and being that vulnerable. And so our job as collaborators is essentially to protect that, in a way, and make it safe for everybody to come as they are. And so I think that’s what it taught me, which is, like, basically holding space for that.

HUIZINGA: Yeah. How’s that working?

ARZANI: First of all, I stumbled into it, but there are people who are already “that” in this building …

HUIZINGA: Really?

ARZANI: … that have been for years. It’s just that now I can see them for what they bring, as opposed to before, I didn’t have the vocabulary for it.

HUIZINGA: Gotcha …

ARZANI: But people who don’t, it’s like what I’ve seen is almost like they initially look at you with skepticism, and then they think it’s a gimmick, and then they are like, what is that? And then they become curious, and then they, too, kind of join you, which is very, very interesting to see. But, like, again, it’s something that already existed. It’s just me not being privileged enough to know about it or, kind of, recognize it before.

HUIZINGA: Yeah. Can that become part of a culture, or do you feel like it is part of the culture here at Microsoft Research, or … ?

ARZANI: I think this depends on how people individually choose to show up. And I think we’re all, at the end of the day, individuals. And a lot of people are that way without knowing they are that way. So maybe it is already part of the culture. I haven’t necessarily sat down and thought about it deeply, so I can’t say.

HUIZINGA: Yeah, yeah. But it would be a dream to have the ability to be that vulnerable through storytelling as part of the research process?

ARZANI: I think so. We had a storytelling coach that would say, “Tell your story, change the world.” And as researchers, we are attempting to change the world, and part of that is our stories. And so maybe, yeah! And basically, what we’re doing here is, I’m telling my story. So …

HUIZINGA: Yeah.

ARZANI: … maybe you’re changing the world!

HUIZINGA: You know, I’m all in! I’m here for it, as they say. Behnaz Arzani. It is such a pleasure—always a pleasure—to talk to you. Thanks for sharing your story with us today on Ideas.

ARZANI: Thank you.

[MUSIC]

The post Ideas: Solving network management puzzles with Behnaz Arzani appeared first on Microsoft Research.

Read More

Research Focus: Week of June 10, 2024

Research Focus: Week of June 10, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: June 10, 2024

RELEVANCE: Automatic evaluation framework for LLM responses

Relevance in AI refers to the usefulness of information or actions to a specific task or query. It helps determine the accuracy, effectiveness, efficiency, and user satisfaction of content from search engines, chatbots, and other AI systems.

RELEVANCE (Relevance and Entropy-based Evaluation with Longitudinal Inversion Metrics) is a generative AI evaluation framework designed by researchers at Microsoft to automatically evaluate creative responses from large language models (LLMs). RELEVANCE combines custom tailored relevance assessments with mathematical metrics to ensure AI-generated content aligns with human standards and maintains consistency. Monitoring these metrics over time enables the automatic detection of when the LLM’s relevance evaluation starts to slip or hallucinate.

Custom relevance evaluation alone involves scoring responses based on predefined criteria. However, while these scores provide a direct assessment, they might not capture the full complexity and dynamics of response patterns over multiple evaluations or different sets of data (e.g. model hallucination and model slip). To address this issue, RELEVANCE integrates mathematical techniques with custom evaluations to ensure LLM response accuracy over time and adaptability to evolving LLM behaviors without involving manual review.


Recyclable vitrimer-based printed circuit boards for sustainable electronics

Printed circuit boards (PCBs) are ubiquitous in electronics and make up a substantial fraction of environmentally hazardous electronic waste when devices reach end-of-life. Their recycling is challenging due to their use of irreversibly cured thermoset epoxies in manufacturing. Researchers at Microsoft and the University of Washington aim to tackle this challenge, and potentially pave the way for sustainability transitions in the electronics industry. In a recent paper, published in Nature Sustainability: Recyclable vitrimer-based printed circuit boards for sustainable electronics, they present a PCB formulation using transesterification vitrimers (vPCBs) and an end-to-end fabrication process compatible with standard manufacturing ecosystems. This cradle-to-cradle life cycle assessment shows substantial environmental impact reduction of vPCBs over conventional PCBs in 11 categories. The team successfully manufactured functional prototypes of internet of things devices transmitting 2.4 GHz radio signals on vPCBs with electrical and mechanical properties meeting industry standards. Fractures and holes in vPCBs are repairable while retaining comparable performance over multiple repair cycles. The researchers also demonstrate a non-destructive recycling process based on polymer swelling with small-molecule solvents. Unlike traditional solvolysis recycling, this swelling process does not degrade the materials. A dynamic mechanical analysis finds negligible catalyst loss, minimal changes in storage modulus, and equivalent polymer backbone composition across multiple recycling cycles. This recycling process achieves 98% polymer recovery, 100% fiber recovery, and 91% solvent recovery to create new vPCBs without performance degradation, potentially paving the way to circularity in electronics.

microsoft research podcast

What’s Your Story: Weishung Liu

Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.


LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has reached billions of parameters, requiring large amounts of memory and resulting in significant inference latency, even on cutting edge AI-accelerators, such as graphics processing units (GPUs). Attempts to deliver the low latency demands of the applications relying on such large models do not cater to the computationally distinct nature of different phases during inference and thus fail to utilize the underlying hardware efficiently.

In a recent paper: Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, researchers from Microsoft propose a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. The researchers show that the associative property of online softmax can be treated as a reduction operation, thus allowing them to parallelize the attention computation over these large context lengths. They extend the “stream-K” style reduction of tiled calculation to self-attention to enable the parallel computation, resulting in near 100% GPU utility and an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.


WaveCoder: Widespread and Versatile Enhanced Instruction Tuning with Refined Data Generation

Recent research demonstrates that an LLM finetuned on a high-quality instruction dataset can obtain impressive abilities to address code-related tasks. However, existing methods for instruction data generation often produce duplicate data and are not controllable enough on data quality.

In a recent paper: WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation, researchers from Microsoft extend the generalization of instruction tuning by classifying the instruction data to four code-related tasks and propose an LLM-based generator-discriminator data process framework to generate diverse, high-quality instruction data from open source code. They introduce CodeSeaXDataset, a dataset comprising 19,915 instruction instances across four universal code-related tasks. In addition, they present WaveCoder, a fine-tuned code LLM with widespread and versatile enhanced instruction tuning. This model is specifically designed for enhancing instruction tuning of code LLMs. Their experiments show that WaveCoder models outperform other open-source models in terms of generalization ability across different code-related tasks at the same level of fine-tuning scale. Moreover, WaveCoder exhibits high efficiency in previous code generation tasks.


New course offers AutoGen training

DeepLearning.AI (opens in new tab), in collaboration with Microsoft and Penn State University, is offering a short training course: AI Agentic Design Patterns with AutoGen (opens in new tab), centered around the multi-agent framework for next-generation AI applications. Taught by AutoGen creators Chi Wang, principal researcher at Microsoft Research AI Frontiers, and Qingyun Wu, assistant professor at Penn State, the course explores how to use AutoGen to build and customize multi-agent systems, enabling agents to take on different roles and collaborate to accomplish complex tasks. You can learn more details in this video (opens in new tab).

AutoGen was designed to simplify the orchestration, optimization, and automation of LLM workflows, and is adopted widely as a generic programming framework for agentic AI. It offers customizable and conversable agents that leverage the strongest capabilities of the most advanced LLMs, like GPT-4, while addressing their limitations by integrating with humans and tools and having conversations between multiple agents via automated chat.

Microsoft Research in the news


Superfast Microsoft AI is first to predict air pollution for the whole world 

Nature | June 4, 2004

An AI model developed by Microsoft can accurately forecast weather and air pollution for the whole world — and it does it in less than a minute. The model, called Aurora, also forecasts global weather for ten days.


Chatbot teamwork makes the AI dream work 

Wired | June 6, 2024

LLMs often stumble over math problems because they work by providing statistically plausible text rather than rigorous logical reasoning. Researchers from Microsoft show that having AI agents collaborate can mitigate that weakness.


1-bit LLMs Could Solve AI’s Energy Demands – IEEE Spectrum 

IEEE Spectrum |May 30, 2024

“One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs,” — Furu Wei, Microsoft Research.

The post Research Focus: Week of June 10, 2024 appeared first on Microsoft Research.

Read More

SIBYL: A machine learning-based framework for forecasting dynamic workloads

SIBYL: A machine learning-based framework for forecasting dynamic workloads

This paper was presented at the ACM SIGMOD/Principles of Database Systems Conference (opens in new tab) (SIGMOD/PODS 2024), the premier forum on large-scale data management and databases.

SIGMOD/PODS 2024 logo to the left of the first page of accepted paper,

In today’s fast-paced digital landscape, data analysts are increasingly dependent on analytics dashboards to monitor customer engagement and app performance. However, as data volumes increase, these dashboards can slow down, leading to delays and inefficiencies. One solution is to use software designed to optimize how data is physically stored and retrieved, but the challenge remains in anticipating the specific queries analysts will run, a task complicated by the dynamic nature of modern workloads.

In our paper, “SIBYL: Forecasting Time-Evolving Query Workloads,” presented at SIGMOD/PODS 2024, we introduce a machine learning-based framework designed to accurately predict queries in dynamic environments. This innovation allows traditional optimization tools, typically meant for static settings, to seamlessly adapt to changing workloads, ensuring consistent high performance as query demands evolve.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


SIBYL’s design and features

SIBYL’s framework is informed by studies of real-world workloads, which show that most are dynamic but follow predicable patterns. We identified the following recurring patterns in how parameters change over time:

  • Trending: Queries that increase, decrease, or remain steady over time.
  • Periodic: Queries that occur at regular intervals, such as hourly or daily.
  • Combination: A mix of trending and periodic patterns.
  • Random: Queries with unpredictable patterns.

These insights, illustrated in Figure 1, form the basis of SIBYL’s ability to forecast query workloads, enabling databases to maintain peak efficiency even as usage patterns shift.

A figure illustrating the analysis of how parameter changes with query arrival times, identifying four common patterns. The Y-axis represents the query arrival time and the X-axis shows the parameter values. Section (a) shows the trending pattern, which includes increasing, decreasing trends. Section (b) displays the periodic pattern, characterized by a regular pattern with fixed intervals such as hourly, daily, or weekly. Section (c) combines the trending and periodic patterns, while section (d) represents the random pattern, indicating no regular or predictable pattern.
Figure 1. We studied the changing patterns and predictability of database queries by analyzing two weeks’ worth of anonymized data from Microsoft’s telemetry system, which guides decision-making for Microsoft products and services.

SIBYL uses machine learning to analyze historical data and parameters to predict queries and arrival times. SIBYL’s architecture, illustrated in Figure 2, operates in three phases:

  • Training: It uses historical query logs and arrival times to build machine learning models.
  • Forecasting: It employs pretrained models to predict future queries and their timing.
  • Incremental fine-tuning: It continuously adapts to new workload patterns through an efficient feedback loop.
The figure shows SIBYL’s three phases. The first phase is a training phase: it featurizes the past queries and their arrival time, and trains ML models from scratch. The second phase is forecasting phase: it continuously receives recent queries from the workload traces and employs the pre-trained ML models from the training phase to predict the queries within the next time interval along with their expected arrival time. The last phase is the Incremental fine-tuning, it monitors model accuracy and detects workload shifts (e.g., new types of queries emerging in the workload) via a feedback loop. It adjusts its models efficiently by fine-tuning incrementally on the shifted workload, without retraining from scratch.
Figure 2. An overview of SIBYL’s architecture.

Challenges and innovations in designing a forecasting framework

Designing an effective forecasting framework is challenging, particularly in managing the varying number of queries and the complexity of creating separate models for each type of query. SIBYL addresses these by grouping high-volume queries and clustering low-volume ones, supporting scalability and efficiency. As demonstrated in Figure 3, SIBYL consistently outperforms other forecasting models, maintaining accuracy over different time intervals and proving its effectiveness in dynamic workloads.

The figure presents a comprehensive comparison of four forecasting models across three different workloads: Telemetry, SCOPE, and BusTracker, and Sales dataset. The models compared are History-Based, Random Forest, Vanilla LSTM, and Sibyl-LSTMs. These models are evaluated based on three metrics: Recall, Precision, and F-1 Score. Each metric is represented in a separate column, while the workloads are organized in rows. The evaluation is done over different forecast intervals: 1 Hour, 6 Hours, 12 Hours, and 1 Day. 

Sibyl-LSTMs surpasses other forecasting models and maintains stable accuracy across various time intervals settings. Vanilla LSTM and Random Forecast perform poorly on the Sales workload, which has more outliers and more unstable patterns. For Telemetry workload, the history-based method performs well with the 12-hour interval due to the workload’s recurrent queries that have the same parameter values within a day (between the past 12-hour window and the future 12-hour window). But this method is ineffective with the one-day interval, as many query parameter values change when crossing the day boundary. The history-based method yields unsatisfactory results for the other three workloads that exhibit more rapid and intricate evolution and involve time-related parameters that operate on a finer time scale. Therefore, it is imperative to use an ML-based forecasting model to handle the evolving workload.
Figure 3. SIBYL-LSTM’s accuracy compared with other models in forecasting queries for the next time interval.

SIBYL adapts to changes in workload patterns by continuously learning, retaining high accuracy with minimal adjustments. As shown in Figure 4, the model reaches 95% accuracy after fine-tuning in just 6.4 seconds, nearly matching its initial accuracy of 95.4%.

The figure consists of two parts a and b.  (a) depicts a pattern change of a parameter in the Telemetry workload. The Y-axis represents the query arrival time and the X-axis shows the parameter values. The shift in the patten starts from May 13 (highlighted in light blue), which Sibyl detects by observing the decline in accuracy. The model accuracy on the shifted pattern is 51.9%, which falls below the threshold 𝛼 = 75%, triggering model fine-tuning.  Figure 11 (b) shows that Sibyl fine-tunes the Sibyl-LSTMs by incrementally training on newly observed data, rather than training from scratch. The Y-axis represents recall, and the X-axis shows the number of epochs. Th figure demonstrates that the model converges in just two epochs, taking 6.4 seconds of overhead, and improves accuracy to 95.0%, which is close to the pre-trained accuracy of 95.4%.
Figure 4. Fine-tuning results on telemetry workload changes.

To address slow dashboard performance, we tested SIBYL by using it to create materialized views—special data structures that make queries run faster. These views identify common tasks and recommend which ones to store in advance, expediting future queries.

We trained SIBYL using 2,237 queries from anonymized Microsoft sales data over 20 days, enabling us to create materialized views for the following day. Using historical data improved query performance 1.06 times, while SIBYL’s predictions achieved a 1.83-time increase. This demonstrates that SIBYL’s ability to forecast future workloads can significantly improve database performance.

Implications and looking ahead

SIBYL’s ability to predict dynamic workloads has numerous applications beyond improving materialized views. It can help organizations efficiently scale resources, leading to reduced costs. It can also improve query performance by automatically organizing data, ensuring that the most frequently accessed data is always available. Moving forward, we plan to integrate more machine learning techniques, making SIBYL even more efficient, reducing the effort needed for setup, and improving how databases handle dynamic workloads, making them faster and more reliable.

Acknowledgments

We would like to thank our paper co-authors for their valuable contributions and efforts: Jyoti Leeka, Alekh Jindal, and Jishen Zhao.

The post SIBYL: A machine learning-based framework for forecasting dynamic workloads appeared first on Microsoft Research.

Read More

LST-Bench: A new benchmark tool for open table formats in the data lake

LST-Bench: A new benchmark tool for open table formats in the data lake

This paper was presented at the ACM SIGMOD/Principles of Database Systems Conference (opens in new tab) (SIGMOD/PODS 2024), the premier forum on large-scale data management and databases.

SIGMOD PODS 2024 logo to the left of the first page of

As organizations grapple with ever-expanding datasets, the adoption of data lakes has become a vital strategy for scalable and cost-effective data management. The success of these systems largely depends on the file formats used to store the data. Traditional formats, while efficient in data compression and organization, falter with frequent updates. Advanced table formats like Delta Lake, Apache Iceberg, and Apache Hudi offer promising solutions with easier data modifications and historical tracking, yet their efficacy lies in their ability to handle continuous updates, a challenge that requires extensive and thorough evaluation.

Our paper, “LST-Bench: Benchmarking Log-Structured Tables in the Cloud (opens in new tab),” presented at SIGMOD 2024, introduces an innovative tool designed to evaluate the performance of different table formats in the cloud. LST-Bench builds on the well-established TPC-DS (opens in new tab) benchmark—which measures how efficiently systems handle large datasets and complex queries—and includes features specifically designed for table formats, simplifying the process of testing them under real-world conditions. Additionally, it automatically conducts tests and collects essential data from both the computational engine and various cloud services, enabling accurate performance evaluation.

Flexible and adaptive testing

Designed for flexibility, LST-Bench adapts to a broad range of scenarios, as illustrated in Figure 1. The framework was developed by incorporating insights from engineers, facilitating the integration of existing workloads like TPC-DS, while promoting reusability. For example, each test session establishes a new connection to the data-processing engine, organizing tasks as a series of statements. This setup permits developers to run multiple tasks either sequentially within a single session or concurrently across various sessions, reflecting real-world application patterns.

A diagram showing workload components in LST-Bench and their relationships.
Figure 1. Workload components in LST-Bench and their relationships. A task is a sequence of SQL statements, while a session is a sequence of tasks that represents a logical unit of work or a user session. A phase is a group of concurrent sessions that must be completed before the next phase can start. Lastly, a workload is a sequence of phases.

The TPC-DS workload comprises the following foundational tasks:

  • Load task: Loads data into tables for experimentation.
  • Single User task: Executes complex queries to test the engine’s upper performance limit.
  • Data Maintenance task: Handles data insertions and deletions.

LST-Bench introduces the following tasks specific to table formats:

  • Optimize task: Compacts the data files within a table.
  • Time Travel task: Enables querying data as it appeared at a specified point in the past.
  • Parameterized Custom task: Allows for the integration of user-defined code to create dynamic workflows.

These features enable LST-Bench to evaluate aspects of table formats that are not covered by TPC-DS, providing deeper insights into their performance, as shown in Figure 2.

A diagram illustrating various LST-Bench tasks combined to create workloads that provide insights into table formats. The workloads assess the handling of frequent data modifications over time, optimizing tables for multiple modifications of varying sizes, managing simultaneous reading and writing sessions, querying data across different time points, and evaluating the impact of batch size variations on read query performance.
Figure 2. LST-Bench expands on TPC-DS by introducing a flexible workload representation and incorporating extensions that help users gain insights into table formats previously overlooked by the original benchmark.

A degradation rate metric to measure stability

In addition to these workload extensions, LST-Bench introduces new metrics to evaluate table formats both comprehensively and fairly. It retains the traditional metric categories like performance, storage, and compute efficiency, and it adds a new stability metric called degradation rate. This new metric specifically addresses the impact of accumulating small files in the data lake—a common issue arising from frequent, small updates—providing an assessment of the system’s efficiency over time.

The degradation rate is calculated by dividing a workload into different phases. The degradation rate (S_{DR}) is defined as follows:

(S_{DR}={1over n}sumlimits_{i=1}^ndfrac{M_{i} – M_{i-1}}{M_{i-1}})

Here, (M_i) represents the performance or efficiency metric value of the (i^{th}) iteration of a workload phase, and (n) reflects the total number of iterations of that phase. Intuitively, (S_{DR}) is the rate at which a metric grows or shrinks, reflecting cumulative effects of changes in the underlying system’s state. This rate provides insight into how quickly a system degrades over time. A stable system demonstrates a low (S_{DR}), indicating minimal degradation.

LST-Bench implementation

The LST-Bench features a Java-based client application that runs SQL workloads on various engines, enabling users to define tasks, sessions, and phase libraries to reuse different workload components. This allows them to reference these libraries in their workload definitions, add new task templates, or create entirely new task libraries to model-specific scenarios.

LST-Bench also includes a processing module that consolidates experimental results and calculates metrics to provide insights into table formats and engines. It uses both internal telemetry from LST-Bench and external telemetry from cloud services, such as resource utilization, storage API calls, and network I/O volume. The metrics processor offers multiple visualization options, including notebooks and a web app, to help users analyze performance data effectively.

An illustration depicting the components and execution model of the LST-Bench tool. The Client Application establishes connections with engines via dedicated drivers, while the Metrics Processor gathers telemetry from the Client Application, engines, and other cloud services. This data is aggregated and visualized using either a notebook or web application.
Figure 3. The LST-Bench tool components and execution model.

Implications and looking ahead

LST-Bench integrates seamlessly into the testing workflows of the Microsoft Fabric (opens in new tab) warehouse, allowing that team to rigorously assess engine performance, evaluate releases, and identify any issues. This leads to a more reliable and optimized user experience on the Microsoft Fabric data analytics platform. Additionally, LST-Bench holds promise as a foundational tool for various Microsoft initiatives. It’s currently instrumental in research projects focused on improving data organization for table formats, with the goal of increasing the performance of customer workloads on Microsoft Fabric. LST-Bench is also being used to evaluate the performance of table formats converted using Apache XTable (Incubating) (opens in new tab), an open-source tool designed to prevent data silos within data lakes.

LST-Bench is open source (opens in new tab), and we welcome contributors to help expand this tool, making it highly effective for organizations to thoroughly evaluate their table formats.

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.


Acknowledgements

We would like to thank Joyce Cahoon (opens in new tab) and Yiwen Zhu (opens in new tab) for their valuable discussions on the stability metric, and Jose Medrano (opens in new tab) and Emma Rose Wirshing (opens in new tab) for their feedback on LST-Bench and their work on integrating it with the Microsoft Fabric Warehouse.

The post LST-Bench: A new benchmark tool for open table formats in the data lake appeared first on Microsoft Research.

Read More

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon. 

Keynote: Building Globally Equitable AI

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi at Microsoft Research Forum Episode 3

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi 

Jacki O’Neill discussed the importance of creating globally equitable generative AI. She addressed the technical and sociotechnical challenges that must be tackled to positively transform the future of work worldwide.

“We’re at the very early stage of generative AI and the impacts it will have on work. This is a fast-moving field, and there’s an immense opportunity to take control of the agenda and build truly globally equitable AI systems. This requires ensuring that diverse contexts and applications, with their diverse datasets, drive the development of generative AI.”

Panel discussion: Generative AI for Global Impact: Challenges and Opportunities

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)

Sunayana Sitaram, Principal Researcher, Microsoft Research India

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)
Sunayana Sitaram, Principal Researcher, Microsoft Research India
Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge
Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Microsoft researchers discussed the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

“How can we take this power of generative AI and empower every individual, every individual across the globe—the people who are coming from different nationalities, different ethnicities, cultures, as well as with varied technology access and financial affordability?”

—Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

“One of the solutions that we’ve been using is to actually design with ‘human in the loop’ in mind because we know that these technologies are not perfect. And so, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome.”

—Sunayana Sitaram, Principal Researcher, Microsoft Research India

“We really need multidisciplinary research that goes beyond anything that we’ve done before, involving researchers and practitioners and community members. And it’s important to remember that machine learning engineers and researchers on their own can’t solve the problem of building globally equitable generative AI. This is something that we really need to do in a large scale.”

—Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi 

“An estimated 1.3 billion people—around 16 percent of the global population—live with some level of disability today. So, I think it’s really exciting to see these generative AI applications coming online for these communities.” 

“As we look to this next decade of generative AI solutions, I really hope to see that we’re going to see more personalized AI models and solutions come through much more strongly, solutions where you as the user have much more control, much more agency, around how your model works for you.” 

—Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Lightning talk: Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge at Research Forum Episode 3

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Daniela Massiceti explored the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explored the current distance from realizing this potential and the advancements needed to bridge this gap.

“Today’s AI models hold incredible potential for assisting the Blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more.”

Lightning talk: Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia, at Research Forum Episode 3

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia

Jiang Bian discussed how generative AI transforms industries by bridging gaps between AI capabilities and industrial needs.

“In our dialogues with strategic partners, we have identified crucial gaps in current generative AI capabilities versus the specific needs of industry applications. These include a too-narrow focus on human-like AI but not critical industry applications, limitations in processing complex and noisy data, and concerns about reliability in complex decision-making scenarios. Our research is crucial in addressing these limitations and amplifying the underappreciated potential of generative AI in high-value sectors.” 

Lightning talk: MatterGen: A Generative Model for Materials Design

Tian Xie, Principal Research Manager, Microsoft Research, at Research Forum Episode 3

Tian Xie, Principal Research Manager, Microsoft Research

Tian Xie described MatterGen, a generative model that enables the design of new inorganic materials based on a broad range of property conditions required by the application, aiming to shift the traditional paradigm of materials design with generative AI.

“Traditionally, materials design is conducted by search-based methods. We search through a list of candidates and gradually filter them using a list of design criteria for the application. Like for batteries, we need the materials to contain lithium, to be stable, to have a high lithium-ion conductivity, and each filtering step can be conducted using simulation-based methods or AI emulators. At the end, we get five to 10 candidates that we’re sending to the lab for experimental synthesis.” 

“In MatterGen, we hope to rethink this process with generative AI. We’re aiming to directly generate materials given the design requirements for the target application, bypassing the process of searching through candidates. You can think of it as using text-to-image generative models like DALL-E to generate the images given a prompt rather than needing to search through the entire internet for images via a search engine.” 

Lightning talk: AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers, at Research Forum Episode 3

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers 

Adam Fourney discussed the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He showcased their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

“We’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.”

The post Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more appeared first on Microsoft Research.

Read More

Microsoft at FAccT 2024: Advancing responsible AI research and practice

Microsoft at FAccT 2024: Advancing responsible AI research and practice

Microsoft at ACM FAccT 2024

The integration of AI and other computational technologies is becoming increasingly common in high-stakes sectors such as finance, healthcare, and government, where their capacity to influence critical decisions is growing. While these systems offer numerous benefits, they also introduce risks, such as entrenching systemic biases and reducing accountability. The ACM Conference on Fairness, Accountability, and Transparency (ACM FaccT 2024) tackles these issues, bringing together experts from a wide range of disciplines who are committed to the responsible development of computational systems.

Microsoft is proud to return as a sponsor of ACM FAccT 2024, underscoring our commitment to supporting research on responsible AI. We’re pleased to share that members of our team have taken on key roles in organizing the event, contributing to the program committee and serving as a program co-chair. Additionally, seven papers by Microsoft researchers and their collaborators have been accepted to the program, with “Akal badi ya bias: An exploratory study of gender bias in Hindi language technology,” receiving an award for Best Paper. 

Collectively, these research projects emphasize the need for AI technologies that reflect the Microsoft Responsible AI principles of accountability, inclusiveness, reliability and safety, fairness, transparency, and privacy and security. They underscore the importance of addressing potential risks and harms associated with deployment and usage. This post highlights these advances.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.


Paper highlights

A framework for exploring the consequences of AI-mediated enterprise knowledge access and identifying risks to workers

Anna Gausen, Bhaskar Mitra, Siân Lindley

Recent AI developments, especially LLMs, are significantly impacting organizational knowledge access and reshaping workplaces. These AI systems pose risks due to their interaction with organizational power dynamics. This paper introduces the Consequence-Mechanism-Risk framework to help identify worker risks, categorizing them into issues related to value, power, and wellbeing. The framework aims to help practitioners mitigate these risks and apply it to other technologies, enabling better protection for workers.

A structured regression approach for evaluating model performance across intersectional subgroups

Christine Herlihy, Kimberly Truong, Alex Chouldechova, Miro Dudík

Disaggregated evaluation is a process used in AI fairness assessment that measures AI system performance across different subgroups. These subgroups are defined by a mix of demographic or other sensitive attributes. However, the sample size for intersectional subgroups is often very small, leading to their exclusion from analysis. This work introduces a structured regression approach for more reliable system performance estimates in these subgroups. Tested on two publicly available datasets and several variants of semi-synthetic data, this method not only yielded more accurate results but also helped to identify key factors driving performance differences. 

Akal badi ya bias: An exploratory study of gender bias in Hindi language technology

Best Paper Award

Rishav Hada, Safiya Husain, Varun Gumma, Harshita Diddee, Aditya Yadavalli, Agrima Seth, Nidhi Kulkarni, Ujwal Gadiraju, Aditya Vashistha, Vivek Seshadri, Kalika Bali

Existing research on gender bias in language technologies primarily focuses on English, often overlooking non-English languages. This paper introduces the first comprehensive study on gender bias in Hindi, the third most spoken language globally. Employing diverse techniques and field studies, the authors expose the limitations in current methodologies and emphasize the need for more context-specific and community-centered research. The findings deepen the understanding of gender bias in language technologies in Hindi and lay the groundwork for expanded research into other Indic languages.

“I’m not sure, but…”: Examining the impact of large language models’ uncertainty expression on user reliance and trust

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, Jennifer Wortman Vaughan

LLMs can produce convincing yet incorrect responses, potentially misleading users who rely on them for accuracy. To mitigate this issue, there have been recommendations for LLMs to communicate uncertainty in their responses. In a large-scale study on how users perceive and act on LLMs’ expressions of uncertainty, participants were asked medical questions. The authors found that first-person uncertainty expressions (e.g., “I’m not sure, but…”) decreased participants’ confidence in the system and their tendency to agree with the system’s answers, while increasing the accuracy of their own answers. In contrast, more general uncertainty expressions (e.g., “It’s unclear, but…”) were less effective. The findings stress the importance of more thorough user testing before deploying LLMs.

Investigating and designing for trust in AI-powered code generation tools

Ruotong Wang, Ruijia Cheng, Denae Ford, Tom Zimmermann

As tools like GitHub Copilot gain popularity, understanding the trust software developers place in these applications becomes crucial for their adoption and responsible use. In a two-stage qualitative study, the authors interviewed 17 developers to understand the challenges they face in building trust in AI code-generation tools. Challenges identified include setting expectations, configuring tools, and validating suggestions. The authors also explore several design concepts to help developers establish appropriate trust and provide design recommendations for AI-powered code-generation tools.

Less discriminatory algorithms

Emily Black, Logan Koepke, Pauline Kim, Solon Barocas, Mingwei Hsu

In fields such as housing, employment, and credit, organizations using algorithmic systems should seek to use less discriminatory alternatives. Research in computer science has shown that for any prediction problem, multiple algorithms can deliver the same level of accuracy but differ in their impacts across demographic groups. This phenomenon, known as model multiplicity, suggests that developers might be able to find an equally performant yet potentially less discriminatory alternative.

Participation in the age of foundation models

Harini Suresh, Emily Tseng, Meg Young, Mary Gray, Emma Pierson, Karen Levy

The rise of foundation models in public services brings both potential benefits and risks, including reinforcing power imbalances and harming marginalized groups. This paper explores how participatory AI/ML methods, typically context-specific, can be adapted to these context-agnostic models to empower those most affected.

Conference organizers from Microsoft

Program Co-Chair

Alexandra Olteanu 

Program Committee

Steph Ballard 
Solon Barocas 
Su Lin Blodgett*
Kate Crawford 
Shipi Dhanorkar 
Amy Heger
Jake Hofman*
Emre Kiciman*
Vera Liao*
Daniela Massiceti 
Bhaskar Mitra 
Besmira Nushi*
Alexandra Olteanu 
Amifa Raj
Emily Sheng 
Jennifer Wortman Vaughan*
Mihaela Vorvoreanu*
Daricia Wilkinson

*Area Chairs

Career opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Research, and other departments. We are always pushing the boundaries of computer systems to improve the scale, efficiency, and security of all our offerings. You can review our open research-related positions here.

The post Microsoft at FAccT 2024: Advancing responsible AI research and practice appeared first on Microsoft Research.

Read More

Introducing Aurora: The first large-scale foundation model of the atmosphere

Introducing Aurora: The first large-scale foundation model of the atmosphere

satellite image of Storm Ciarán

When Storm Ciarán battered northwestern Europe in November 2023, it left a trail of destruction. The low-pressure system associated with Storm Ciarán set new records for England, marking it as an exceptionally rare meteorological event. The storm’s intensity caught many off guard, exposing the limitations of current weather-prediction models and highlighting the need for more accurate forecasting in the face of climate change. As communities grappled with the aftermath, the urgent question arose: How can we better anticipate and prepare for such extreme weather events? 

A recent study by Charlton-Perez et al. (2024) underscored the challenges faced by even the most advanced AI weather-prediction models in capturing the rapid intensification and peak wind speeds of Storm Ciarán. To help address those challenges, a team of Microsoft researchers developed Aurora, a cutting-edge AI foundation model that can extract valuable insights from vast amounts of atmospheric data. Aurora presents a new approach to weather forecasting that could transform our ability to predict and mitigate the impacts of extreme events—including being able to anticipate the dramatic escalation of an event like Storm Ciarán.  

A flexible 3D foundation model of the atmosphere

Aurora is a 1.3 billion parameter foundation model for high-resolution  forecasting of weather and atmospheric processes. Aurora is a flexible 3D Swin Transformer with 3D Perceiver-based encoders and decoders. At pretraining time, Aurora is optimised to minimise a loss on multiple heterogeneous datasets with different resolutions, variables, and pressure levels. The model is then fine-tuned in two stages: (1) short-lead time fine-tuning of the pretrained weights (2) long-lead time (rollout) fine-tuning using Low Rank Adaptation (LoRA). The fine-tuned models are then deployed to tackle a diverse collection of operational forecasting scenarios at different resolutions.
Figure 1: Aurora is a 1.3 billion parameter foundation model for high-resolution forecasting of weather and atmospheric processes. Aurora is a flexible 3D Swin Transformer with 3D Perceiver-based encoders and decoders. At pretraining time, Aurora is optimized to minimize a loss on multiple heterogeneous datasets with different resolutions, variables, and pressure levels. The model is then fine-tuned in two stages: (1) short-lead time fine-tuning of the pretrained weights and (2) long-lead time (rollout) fine-tuning using Low Rank Adaptation (LoRA). The fine-tuned models are then deployed to tackle a diverse collection of operational forecasting scenarios at different resolutions.

Aurora’s effectiveness lies in its training on more than a million hours of diverse weather and climate simulations, which enables it to develop a comprehensive understanding of atmospheric dynamics. This allows the model to excel at a wide range of prediction tasks, even in data-sparse regions or extreme weather scenarios. By operating at a high spatial resolution of 0.1° (roughly 11 km at the equator), Aurora captures intricate details of atmospheric processes, providing more accurate operational forecasts than ever before—and at a fraction of the computational cost of traditional numerical weather-prediction systems. We estimate that the computational speed-up that Aurora can bring over the state-of-the-art numerical forecasting system Integrated Forecasting System (IFS) is ~5,000x. 

Beyond its impressive accuracy and efficiency, Aurora stands out for its versatility. The model can forecast a broad range of atmospheric variables, from temperature and wind speed to air-pollution levels and concentrations of greenhouse gases. Aurora’s architecture is designed to handle heterogeneous gold standard inputs and generate predictions at different resolutions and levels of fidelity. The model consists of a flexible 3D Swin Transformer with Perceiver-based encoders and decoders, enabling it to process and predict a range of atmospheric variables across space and pressure levels. By pretraining on a vast corpus of diverse data and fine-tuning on specific tasks, Aurora learns to capture intricate patterns and structures in the atmosphere, allowing it to excel even with limited training data when it is being fine-tuned for a specific task. 

Fast prediction of atmospheric chemistry and air pollution

Sample predictions for total column nitrogen dioxide by Aurora compared to CAMS analysis. Aurora was initialised with CAMS analysis at 1 Sep 2022 00 UTC. Predicting atmospheric gasses correctly is extremely challenging due to their spatially heterogeneous nature. In particular, nitrogen dioxide, like most variables in CAMS, is skewed towards high values in areas with large anthropogenic emissions such as densely populated areas in East Asia. In addition, it exhibits a strong diurnal cycle; e.g., sunlight reduces background levels through a process called photolysis. Aurora accurately captures both the extremes and background levels.
Latitude-weighted root mean square error (RMSE) of Aurora relative to CAMS, where negative values (blue) mean that Aurora is better. The RMSEs are computed over the period Jun 2022 to Nov 2022 inclusive. Aurora matches or outperforms CAMS on 74% of the targets.
Figure 2: Aurora outperforms operational CAMS across many targets. (a) Sample predictions for total column nitrogen dioxide by Aurora compared to CAMS analysis. Aurora was initialized with CAMS analysis at 1 Sep 2022 00 UTC. Predicting atmospheric gases correctly is extremely challenging due to their spatially heterogeneous nature. In particular, nitrogen dioxide, like most variables in CAMS, is skewed toward high values in areas with large anthropogenic emissions, such as densely populated areas in East Asia. In addition, it exhibits a strong diurnal cycle; e.g., sunlight reduces background levels via a process called photolysis. Aurora accurately captures both the extremes and background levels. (b) Latitude-weighted root mean square error (RMSE) of Aurora relative to CAMS, where negative values (blue) mean that Aurora is better. The RMSEs are computed over the period Jun 2022 to Nov 2022 inclusive. Aurora matches or outperforms CAMS on 74% of the targets.

A prime example of Aurora’s versatility is its ability to forecast air-pollution levels using data from the Copernicus Atmosphere Monitoring Service (CAMS), a notoriously difficult task due to the complex interplay of atmospheric chemistry, weather patterns, and human activities, as well as the highly heterogeneous nature of CAMS data. By leveraging its flexible encoder-decoder architecture and attention mechanisms, Aurora effectively processes and learns from this challenging data, capturing the unique characteristics of air pollutants and their relationships with meteorological variables. This enables Aurora to produce accurate five-day global air-pollution forecasts at 0.4° spatial resolution, outperforming state-of-the-art atmospheric chemistry simulations on 74% of all targets, demonstrating its remarkable adaptability and potential to tackle a wide range of environmental prediction problems, even in data-sparse or highly complex scenarios. 

Data diversity and model scaling improve atmospheric forecasting

One of the key findings of this study is that pretraining on diverse datasets significantly improves Aurora’s performance compared to training on a single dataset. By incorporating data from climate simulations, reanalysis products, and operational forecasts, Aurora learns a more robust and generalizable representation of atmospheric dynamics. It is thanks to its scale and diverse pretraining data corpus that Aurora is able outperform state-of-the-art numerical weather-prediction models and specialized deep-learning approaches across a wide range of tasks and resolutions. 

Performance versus ERA5 2021 at 6h lead time for models pretrained on different dataset configurations (i.e., no fine-tuning) labeled by C1-C4. The root mean square errors (RMSEs) are normalised by the performance of the ERA5-pretrained model (C1). Adding low-fidelity simulation data from CMIP6 (i.e., CMCC and IFS-HR) improves performance almost uniformly (C2). Adding even more simulation data improves performance further on most surface variables and for the atmospheric levels present in this newly added data (C3). Finally, configuration C4, which contains a good coverage of the entire atmosphere and also contains analysis data from GFS achieves the best overall performance with improvements across the board.
Pretraining on many diverse data sources improves the forecasting of extreme values at 6h lead time across all surface variables of IFS-HRES 2022. Additionally, the results also hold on wind speed, which is a nonlinear function of 10U and 10V.
Bigger models obtain lower validation loss for the same amount of GPU hours. We fit a power law that indicates a 5% reduction in the validation loss for every doubling of the model size.
Figure 3: Pretraining on diverse data and increasing model size improves performance. (a) Performance versus ERA5 2021 at 6h lead time for models pretrained on different dataset configurations (i.e., no fine-tuning) labeled by C1-C4. The root mean square errors (RMSEs) are normalized by the performance of the ERA5-pretrained model (C1). Adding low-fidelity simulation data from CMIP6 (i.e., CMCC and IFS-HR) improves performance almost uniformly (C2). Adding even more simulation data improves performance further on most surface variables and for the atmospheric levels present in this newly added data (C3). Finally, configuration C4, which contains good coverage of the entire atmosphere and also contains analysis data from GFS achieves the best overall performance with improvements across the board. (b) Pretraining on many diverse data sources improves the forecasting of extreme values at 6h lead time across all surface variables of IFS-HRES 2022. Additionally, the results also hold on wind speed, which is a nonlinear function of 10U and 10V. (c) Bigger models obtain lower validation loss for the same amount of GPU hours. We fit a power law that roughly translates into a 5 reduction in the training loss for every doubling of the model size.

A direct consequence of Aurora’s scale, both in terms of architecture design and training data corpus, as well as its pretraining and fine-tuning protocols, is its superior performance over the best specialized deep learning models. As an additional validation of the benefits of fine-tuning a large model pretrained on many datasets, we compare Aurora against GraphCast — pretrained only on ERA5 and currently considered the most skillful AI model at 0.25-degree resolution and lead times up to five days. Additionally, we include IFS HRES in this comparison, the gold standard in numerical weather prediction. We show that Aurora outperforms both when measured against analysis, weather station observations, and extreme values. 

Scorecard versus GraphCast at 0.25-degrees resolution. Aurora matches or outperforms GraphCast on 94% of targets. Aurora obtains the biggest gains (40%) over GraphCast in the upper atmosphere, where GraphCast performance is known to be poor. Large improvements up to 10-15% are observed at short and long lead times. The two models are closest to each other in the lower atmosphere at the 2--3 day lead time, which corresponds to the lead time GraphCast was rollout-finetuned on. At the same time, GraphCast shows slightly better performance up to five days and at most levels on specific humidity (Q).
Root mean square error (RMSE) for Aurora, GraphCast, and IFS-HRES as measured by global weather stations during 2022 for wind speed and surface temperature.
Thresholded RMSE for Aurora, GraphCast and IFS-HRES normalized by IFS-HRES performance. Aurora demonstrates improved prediction for the extreme values, or tails, of the surface variable distributions. In each plot values to the right of the centre line are cumulative RMSEs for targets found to sit above the threshold, and those to the left represent target values sitting below the threshold.
Figure 4: Aurora outperforms operational GraphCast across the vast majority of targets. (a) Scorecard versus GraphCast at 0.25-degrees resolution. Aurora matches or outperforms GraphCast on 94% of targets. Aurora obtains the biggest gains (40%) over GraphCast in the upper atmosphere, where GraphCast performance is known to be poor. Large improvements up to 10%-15% are observed at short and long lead times. The two models are closest to each other in the lower atmosphere at the 2-3 day lead time, which corresponds to the lead time GraphCast was rollout-finetuned on. At the same time, GraphCast shows slightly better performance up to five days and at most levels on specific humidity (Q). (b) Root mean square error (RMSE) and mean absolute error (MAE) for Aurora, GraphCast, and IFS-HRES as measured by global weather stations during 2022 for wind speed (left two panels) and surface temperature (right two panels). (c) Thresholded RMSE for Aurora, GraphCast and IFS-HRES normalized by IFS-HRES performance. Aurora demonstrates improved prediction for the extreme values, or tails, of the surface variable distributions. In each plot values to the right of the center line are cumulative RMSEs for targets found to sit above the threshold, and those to the left represent target values sitting below the threshold.

A paradigm shift in Earth system modeling 

The implications of Aurora extend far beyond atmospheric forecasting. By demonstrating the power of foundation models in the Earth sciences, this research paves the way for the development of comprehensive models that encompass the entire Earth system. The ability of foundation models to excel at downstream tasks with scarce data could democratize access to accurate weather and climate information in data-sparse regions, such as the developing world and polar regions. This could have far-reaching impacts on sectors like agriculture, transportation, energy harvesting, and disaster preparedness, enabling communities to better adapt to the challenges posed by climate change. 

As the field of AI-based environmental prediction evolves, we hope Aurora will serve as a blueprint for future research and development. The study highlights the importance of diverse pretraining data, model scaling, and flexible architectures in building powerful foundation models for the Earth sciences. With continued advancements in computational resources and data availability, we can envision a future where foundation models like Aurora become the backbone of operational weather and climate prediction systems, providing timely, accurate, and actionable insights to decision-makers and the public worldwide. 

Acknowledgements

We are grateful for the contributions of Cristian Bodnar, a core contributor to this project.

The post Introducing Aurora: The first large-scale foundation model of the atmosphere appeared first on Microsoft Research.

Read More

What’s Your Story: Weishung Liu

What’s Your Story: Weishung Liu

Microsoft Research Podcast | What's Your Story | Weishung Liu

In the Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today.

In this episode, Gehrke is joined by Principal PM Manager Weishung Liu. Liu brings product development and management expertise honed at companies such as Disney, Fluke, and SpaceX to her role at Microsoft, where she helped develop the real-time video analytics platform Watch For and today empowers teams within Microsoft Research to maximize their reach. She talks about how being more homebound as a child cultivated the love of people and stories that underlies her professional pursuits and how she landed in tech despite efforts to “rebel” against the expectations that come with growing up in Silicon Valley.

Photos of Weishung Liu, Principal PM Manager, throughout her life.

Transcript

[SPOT]

WEISHUNG LIU: Hey, listeners. I’m Weishung Liu, principal PM manager with Microsoft Research and today’s podcast guest. Before we get started, I want to tell you about Microsoft Research Forum. It’s a series of discussions and talks examining how the rapid advances in AI are impacting science and technology research. The next episode is June 4, and colleagues of mine from around Microsoft Research are participating. I highly recommend checking it out. You can learn more and register now at aka.ms/MyResearchForum. All right, here’s today’s show …

[END OF SPOT] [TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

WEISHUNG LIU: I’ve always felt like I want the things that I work on to create joy in people. … The fact that I can still be here and create impact and do meaningful work and, you know, work on things that create joy and positively impact society, it speaks to me like stories speak to me.

[TEASER ENDS]

JOHANNES GEHRKE: Microsoft Research works at the cutting edge. But how much do we know about the people behind the science and technology that we create? This is What’s Your Story, and I’m Johannes Gehrke. In my 10 years with Microsoft, across product and research, I’ve been continuously excited and inspired by the people I work with, and I’m curious about how they became the talented and passionate people they are today. So I sat down with some of them. Now, I’m sharing their stories with you. In this podcast series, you’ll hear from them about how they grew up, the critical choices that shaped their lives, and their advice to others looking to carve a similar path.

[MUSIC FADES]


In this episode, I’m talking with Principal PM Manager Weishung Liu. Wei has used her love of storytelling and interest in people and their motivations to deliver meaningful products and customer experiences. This includes the creation of a successful line of Disney plush toys and contributions to the satellite internet system Starlink. With Microsoft, she helped develop Watch For, a real-time video analytics platform that has gone on to enhance gaming via streaming highlights and to support content moderation in products such as Xbox. Today, she’s facilitating connections and devising strategies to empower teams within Microsoft Research to maximize their reach. Here’s my conversation with Wei, beginning with her childhood in Silicon Valley.

JOHANNES GEHRKE: Hi, Wei. Welcome to What’s your Story. You’re our principal PM manager here in the lab, and we’ll talk in a little while about, you know, what you’re doing here right now, but maybe let’s start with, how did you actually end up in tech? Where did you grow up?

WEISHUNG LIU: Oh, wow. OK. So this is a very long, long and, like, nonlinear story about how I got into tech. So I grew up in Silicon Valley, which one would assume means just, like, oh, yes, you grew up in Silicon Valley; therefore, you must be in the STEM field, and therefore, you will be in tech for the rest of your life.

GEHRKE: Yep, that’s, sort of, a too familiar a story.

LIU: That’s a very linear story. And I totally actually wanted to rebel against that whole notion of going into tech. So I grew up in Silicon Valley and thought, like, man, I want to not do STEM.

GEHRKE: So did your parents want you to be either a doctor or engineer? Is that the … ?

LIU: Absolutely. It was either a doctor, engineer, or lawyer. So thankfully my sister went the PhD in psychology route, so she, kind of, checked that box for us. And so I was a little bit more free to pursue my very, very, very wide variety of interests. So a little bit of personal information about me. So I grew up a very sick child, and so I was hospitalized a lot. I was in the ER a lot. But that actually afforded me a lot of opportunities to be, sort of, an indoor-only child of reading and playing video games and all sorts of things that I would say, like, expanded my worldview. Like, it was just all sorts of different stories. Like, reading has stories; video games have stories.

GEHRKE: Tell us a story about reading and a story about video games. What …

LIU: Oh my goodness …

GEHRKE: … were your favorite set of books?

LIU: I was really interested in, like, historical fiction at the time. One book that I remember reading about—oh my gosh, it’s a very famous book, and I don’t remember the name anymore. However, it was about a young girl’s perspective of being, living in an internment camp, the Japanese internment camps, back during World War II, I believe, after Pearl Harbor.[1] And it was just kind of her diary and her perspective. It was almost like Diary of Anne Frank but from a Japanese American girl’s perspective instead. And I just loved, kind of, reading about different viewpoints and different eras and trying to understand, like, where do we overlap, how do things change over time, how does history repeat itself in some ways? And, and I love that. And then video games. So I was really into Japanese RPGs back in the day. So it’s funny. I started … my first console was a Mattel Intellivision II, and then it gradually went up to like Nintendo, Super Nintendo, all those, all those consoles. But I had a friend who I used to play RPGs with …

GEHRKE: So these were network RPGs or individual RPGs?

LIU: These were individual RPGs. This is, you know, when I was around 10, the internet appeared, so it probably dates me a little bit. Every time a new RPG came out like by—the company is now called Square Enix but back then it was called SquareSoft—or Nintendo like Zelda, he and I would immediately go out and buy the game or, you know, convince our parents at the time to buy the game, and then we would compete. So, like, this is not couch co-op; he was actually in Texas.

GEHRKE: Like long-distance co-op?

LIU: This is long-distance, long-distance gaming where we would compete to see who would beat the game first.

GEHRKE: Wow.

LIU: No, you’re not allowed to use walkthroughs. And he almost always beat me.

GEHRKE: But these games are like 60-hour, 80-hour games?

LIU: Yeah, like 60- or 80-hour games, but, like, you know, we got so good at them that, well, you had to figure out like how do you, kind of, bypass and get through the main quest as fast as possible. So that was always—

GEHRKE: So any of the side quests and things like that just … ?

LIU: Yeah, oh, yeah, no. So I’m actually a huge completionist, though, so I’d always go back after and do all the side quests to get, you know, we’ll just say “100 percent” achievement. I’m a little bit of an achievement machine that way. But so, like, that kind of stuff was always super fun for me. And so I spent so much of my time then—because I was, kind of, more homebound a lot—just exploring and being curious about things. And, and that got me into art and into design, and I thought, man, I’m going to be an architect someday because I love designing experiences, like spaces for people.

GEHRKE: You thought at that point in time like a real, like a building architect or an architect for like virtual worlds or so … ?

LIU: No, real, like a real physical space that people inhabit and experience. And so, like, I avoided as much STEM as I could in school. I couldn’t, just due to where I lived and grew up and the high school requirements that I had. But the minute I went to college, which happened to be at the University of Washington, which has a great architecture program, I was like, I’m never going to take another STEM class in my life.

GEHRKE: So you enrolled as an architecture major?

LIU: I enrolled as an architecture major, and I was like, I will do what we would call the “natural world” credits, which is kind of the STEM-like things. But I would intentionally find things that were not, like, hard science because I’m like, I’m never going to do this again. I’m never going to be in tech. All these people that are so obsessed with tech who, you know, went to MIT and Stanford, and I’m like, no, no, no, I’m going to be an architecture major.

GEHRKE: So you took, like, the physics for poets class or so …?

LIU: Stuff like that, right. [LAUGHS] Very, very similar. But I ended up just loving learning at school, which is very unsurprising. You know, I took, like, an Arabic poetry class. I took a French fairy tales class. And I just, kind of, explored college and all the things that it had to offer in terms of academics so much that I actually ended up deciding to get two degrees: one in industrial design, which is not too far away from architecture. Architecture is like with large spaces, like you build one building or design one building that lasts maybe 100 years. Industrial design, I, kind of, joke about it. It’s, you know, you design smaller form factors that sometimes, if they’re manufactured with plastics, last millions of years, [LAUGHS] and you build millions of them. But then I also ended up getting a degree in comparative religion, as well. Which it meant that, like, my schooling and my class schedules are always a little bit odd because I’d go from, you know, like, the industrial design shop down in our design building and like making things with my hands and working at the bandsaw, and then I’d, you know, rush to this other class where we have like very fascinating philosophical debates about various things in, sort of, the comparative religion space. And I’d write, you know, 10-page essays and … about all sorts of things. And, you know, there’s, like, the study of death is a great example and how different cultures react to death. But, you know, that was as far away from STEM [LAUGHS] as I could have possibly gone.

GEHRKE: Right. I was just thinking, can you maybe explain to our listeners a little bit who may come a little bit more from the STEM field traditionally, what do you study in comparative [religion], and what is the field like?

LIU: So for me, it was really just, like, I took a lot of classes just trying to understand people. I really … and it sounds, kind of, silly to say it that way, but religion is really formed and shaped by people. And so for me, like, the types of classes that I took were, sort of, like studying Western religion, studying Eastern religion, studying the philosophy of religion, like or even—and this still, I still think about it from time to time—how do you define religion? And just even … there’s still so many scholarly debates about how to define, like, what is a “pure” definition of religion, and nobody can really still identify that yet. Is it, you know, because then there’s this distinction of spiritualism and being religious versus something else or just completely made-up, you know, pseudoscience, whatever, right. People have this wide spectrum of things that they describe. But it’s really around learning about the different foundations of religion. And then people tend to specialize. You know, they might specialize in a particular area like Hinduism or, you know, broadly speaking, Eastern religions, or people will, you know, start focusing on Western religions. Or sometimes I think about a specific topic like the intersection of, for example, religion and death or religion and art or even, you know, religion and violence. And there’s a broad spectrum of things that people start specializing in. And it’s very, it’s, sort of, very much in the mind but very much in the heart of how you understand that.

GEHRKE: Yeah, I can see how it even connects to industrial design because there you also want to capture the heart …

LIU: Yes.

GEHRKE: … the hearts of people, right.

LIU: Yep. And that’s kind of how I, how I describe, you know, when people are like, why did you major in that? Like, what do you even do with that? Did you even think about what career you would have with that? I’m like, no, I just really wanted to learn, and I really wanted to understand people. And I felt like religion is one way to understand, sort of, like, sociologically how people think and get into that deep, like, that deep feeling of faith and where does it come from and how does it manifest and how does it motivate people to do things in life. And to your point, it’s very similar to industrial design because you’re, you know, we talk about design thinking and you have to really deeply understand the user and the people that you’re designing for in order to create something that really lasts, that matters to them. So that’s, kind of, my, at least my undergrad experience. And in a very, very brief way, I’ll just kind of walk through or at least tell you the very nonlinear path that I took to get to where I am here now at Microsoft Research. So like the day after I graduated from the University of Washington, I moved to Florida.

GEHRKE: And just as a question: so you graduated from the University of Washington—did you have like a plan, you know, this is like the career I want to have?

LIU: Oh no! So here’s the funny thing about design, and I hope that, you know, my other, the designers who might be watching or listening [LAUGHS] to this might not get upset—hopefully don’t get upset with me about this—is I love the design thinking aspect of design, like understanding why people do the things they do, what types of habits can you build with the products—physical products? I was very obsessed with physical, tangible things at the time. And then I learned through, like, internships and talking to other designers who were, you know, already in the field that that’s not what they do. That they don’t go and like, oh, let’s go talk to people and understand deeply what they do. Like, there’s other people that do that. OK, well, what do you do? Well, I work in, you know, CAD, or I work on SolidWorks, or I do Rhino, and I do surfacing. I’m like, OK, what else? Who decides what gets made? Oh, that’s like, you know, a product manager or product—oh, what’s that? Who? What? What does that even mean? Like, tell me more about that.

GEHRKE: So it’s like the dichotomy that you see even here in the company where the engineers have to, sort of, build the things, but the product managers are …

LIU: But someone else is …

GEHRKE: … in the middle

LIU: … someone else is, kind of, interpreting what the market and the users are saying, what the business is saying. And I was like, I like doing that because that’s more about understanding people and the business and the reason—the why. And so …

GEHRKE: Just before you go to your career, I mean, I must … I have to ask, what are some of the favorite things that you built during your undergrad? Because you said you really like to build physical things.

LIU: Oh my gosh!

GEHRKE: Maybe one or two things that you actually built …

LIU: Yeah …

GEHRKE: … that was, sort of, so fun.

LIU: So one of my projects was actually a Microsoft-sponsored project for one quarter, and all they showed up with—his name’s Steve Kaneko. He retired not too long ago from here. Steve showed up and said, I want you all to design a memory-sharing device.

GEHRKE: Interesting …

LIU: And that was it.

GEHRKE: So what is memory sharing? He didn’t define what that means?

LIU: He didn’t define it because as designers, that was our way of interpret—we had to interpret and understand what that meant for ourselves. And it was a very, very free-form exploration. And I thought … the place that I started from was … at the time, I was like, there’s like 6 or 7 billion people in the world. How many of them do I actually know? And then how many of them do I actually want to know or maybe I want to know better?

GEHRKE: To share a memory with …

LIU: To share my memories with, to share a part of me. Like, memories are …

GEHRKE: Pretty personal.

LIU: … who we are—or not who we are but parts of who we are—and drive who we become in some ways. And so I thought, you know, what would be cool is if you had a bracelet, and the bracelet were individual links, and each individual link was a photo, like a digital photo, very tiny digital photo, of something that you chose to share. And so, you know, I designed something at the time … like, the story I told was, like, well, you know, this woman who’s young decided to go to, you know, she’s taking the bus, and she put on her, like, “I wish to go to Paris” kind of theme, right. So she had a bunch of Parisian-looking things or something in that vein, right. And, you know, she gets on the bus and her bracelet vibrates. There’s, like, a haptic reaction from this bracelet. And that means that there’s someone else on the bus with this, you know, with a bracelet with their memories. It’s kind of an indicator that people want to share their stories with someone else. And, you know, wouldn’t it be great if, you know, this woman now sits down on the bus, because she sits next to the person who’s wearing it. Turns out to be an elderly woman who’s wearing, coincidentally, you know, her Paris bracelet, but it’s of her honeymoon of her deceased husband from many years ago. And, you know, like, think of the power of the stories that they could share with each other. That, you know, this woman, elderly woman, can share with, you know, this younger woman, who has aspirations to go, and the memories and the relationship that they can build from that. And so that was, kind of, my memory-sharing device at the time.

GEHRKE: I mean, it’s super interesting because, I mean, the way I think about this is that we have memory-sharing applications now like Facebook and Instagram and TikTok and so on, but they, the algorithm decides really …

LIU: Yes …

GEHRKE: … who to share it with and where and why to share it. Whereas here, it’s proximity, right? It somehow leads to this physical and personal connection afterwards, right? The connection is not like, OK, suddenly on my bracelet, her stories show up …

LIU: Yes …

GEHRKE: … but, you know, maybe we sit next to each other on the bus, and it vibrates, and then we start a conversation.

LIU: Exactly. It’s you own, you know, whatever content is on that you choose to have on your physical person, but you’re sharing yourself in a different way, and you’re sharing your memories and you’re sharing a moment. And it might just be a moment in time, right. It doesn’t have to be a long-lasting thing. That, you know, this elderly woman can say, hey, there’s this really great bistro that we tried on, you know, this particular street, and I hope it’s still there, because if you go, ask for this person or try this thing out and, like, what an incredible opportunity it is for this other woman, who, you know, maybe she does someday go to Paris and she does find it. And she thinks of that time, like, how grateful she was to have met, you know, this woman on the bus. And just for that brief whatever bus … however long that bus ride was, to have that connection, to learn something new about someone else, to share and receive a part of somebody else who you may never have known otherwise. And then that was, that was what I was thinking of, you know, in terms of a memory-sharing device was memory creates connections or it reinforces connections. So I guess very similarly to my people thing and being fascinated by people, like, this was my way of trying to connect people in a different way, in the space that they inhabit and not necessarily on their devices.

GEHRKE: And then what did Microsoft say to that? Was there like an end-of-quarter presentation?

LIU: Oh, yeah! There was a, there was a, you know, big old presentation. I can’t even remember which building we were at, but I think everybody was just like, wow, this is great. And that was it. [LAUGHTER]

GEHRKE: And that was it. It sounds like a really fascinating device.

LIU: Yeah, it was. And lots of people came up with all sorts of really cool things because everybody interpreted the, I’ll just say, the prompt differently, right.

GEHRKE: Right …

LIU: … And that was my interpretation of the prompt at the time.

GEHRKE: Well, super interesting.

LIU: Yeah.

GEHRKE: Coming back to, so OK, so you’ve done just a bunch of really amazing projects. You, sort of, it seems like you literally lived the notion of liberal education.

LIU: I did. I, like, even now I just love learning. I get my hands on all sorts of weird things. I picked up whittling as a random example.

GEHRKE: What is whittling? Do I even know what that is? [LAUGHS]

LIU: So whittling is basically carving shapes into wood. So … I’m also very accident prone, so there’s, like, lots of gloves I had to wear to protect my hands. But, you know, it was like, oh, I really just want to pick up whittling. And I literally did, you know. You can grab a stick and you can actually buy balsa wood that’s in a, in decent shape. But you can just start carving away at whatever … whatever you would like to form that piece of wood into, it can become that. So I made a cat, and then I made what I jokingly refer to as my fidget toy at home. It’s just a very smooth object. [LAUGHS]

GEHRKE: That you can hold and …

LIU: I just made it very round and smooth and you can just, kind of, like, rub it, and yeah, it’s …

GEHRKE: Super interesting.

LIU: … it’s … I pick up a lot of random things because it’s just fascinating to me. I learned a bunch of languages when I was in school. I learned Coptic when I was in school for no other reason than, hey, that sounds cool; you can read the Dead Sea Scrolls [LAUGHS] when you learn Coptic—OK!

GEHRKE: Wow. And so much, so important in today’s world, right, which is moving so fast, is a love for learning. And then especially directed in some areas.

LIU: Yeah.

GEHRKE: You know, that’s just really an awesome skill.

LIU: Yeah.

GEHRKE: And so you just graduated. You said you moved to Florida.

LIU: Oh, yes, yes. Yes. So, so about a month before this happened, right—it didn’t just spontaneously happen. A month before, I had a good friend from the architecture program who had said, hey, Wei, you know, I’m applying for this role in guest services at Disney. I was like, really? You can do that? And she’s like, yeah, yeah, yeah. So I was like, that sounds really cool. And I, you know, went to, like, the Disney careers site. I’m like one month or two months away from graduating. Still, like, not sure what I’m totally going to do because at that point, I’m like, I don’t think I want to be a designer because I don’t—the part that I love about it, the part that I have passion about, is not in the actual design of the object, but it’s about the understanding of why it needs to exist.

GEHRKE: The interconnection between the people and the design.

LIU: The people and the design, exactly. And so when I found, I found this, like, product development internship opportunity, and I was like, what does that even mean? That sounds cool. I get to …

GEHRKE: At Disney?

LIU: At Disney. And it was, like—and Disney’s tagline, the theme park merchandise’s tagline, was “creating tangible memories.” I was like, oh boy, this just checks all the boxes. So I applied, I interviewed, did a phone interview, and they hired me within 24 hours. They were like, we would like you to come. And I was like, I would absolutely love to move to Florida and work there. So, yeah, the day after I graduated from U-Dub, I drove all the way across the country from Seattle.

GEHRKE: You drove?

LIU: From Seattle with two cats.

GEHRKE: That must have been an interesting adventure by itself.

LIU: Oh, yes. With two cats in the car, let me tell you, it was fascinating. All the way to Florida, Orlando, Florida. And the day that I got there or, no, two days after I got there, I found out that I was going to be working in the toys area. So plush and dolls, which is, like, you can imagine just absolutely amazing. Making, like, stuffed toys that then—because my office was a mile down the road from Disney’s Animal Kingdom and therefore a couple miles away from Magic Kingdom or Hollywood Studios or EPCOT—I could actually go see, I’ll just say, the “fruits of my labor” instantly and not only that. See it bring joy to children.

GEHRKE: So what is the path? So you would design something, and how quickly would it then actually end up in the park? Or how did you, I mean, how did you start the job?

LIU: What did I do there? Yeah, yeah …

GEHRKE: Well, what’s the interface between the people and the design here?

LIU: Yeah … so, so, really, I didn’t actually do any design. There was an entire group called Disney Design Group that does all the designing there. And so what I did was I understood, what do we need to make and why? What memories are we—what tangible memories do we want to create for people? Why does it matter to them? In many ways, it’s, sort of, like, it’s still a business, right. You’re creating tangible memories to generate revenue and increase the bottom line for the company. But … so my role was to understand what trends were happening: what were the opportunities? What were guests doing in the parks? What types of things are guests looking for? What are we missing in our SKU lineup, or stock-keeping-unit lineup, and then in which merchandising areas do they need to happen? And so I, actually, as part of my internship, my manager said, hey, I let every intern every time they’re here come up with any idea they want, and you just have to see it from start to execution—in addition to all the other stuff that I worked on. I was like, sounds good. And I came up with this idea that I was like, you know, it would be cool … Uglydolls was really popular at the time. Designer toys were getting really popular from Kidrobot, which was kind of, like, there was this vinyl thing and you can—it was just decorative of all different art styles on the same canvas. And I was like, you know, what if we did that with Mickey, and then, you know, what if the story that we’re telling is, you know, just for the parks—Walt Disney World and Disneyland—that there were aliens or monsters coming to visit the park, but they wanted to blend in and fit in? Well, how would they do that? Well, they clearly see Mickey heads everywhere, and Mickey is very popular here clearly, and so they try to dress up like Mickey, but they don’t do it quite well. So they got the shape right, but everything else about them is a little bit different, and they all have their own unique personalities and …

GEHRKE: You can tell a story around them …

LIU: You can tell a story—see, it’s all about stories. And then it … I got buy-in from everybody there, like, all the way up to the VP. I had to get brand because I was messing with the brand icon. But, you know, it became an entire line called Mickey Monsters at Disney. I still have them all. There were two—then it went from plush; it became consumables, which are like edible things. It went into key chains. It went, it was super … it was … I probably went a little bit too hard, or I took the, I think, I took the assignment very seriously. [LAUGHS]

GEHRKE: Yep, yep. Well, it seemed to be a huge success, as well.

LIU: Yeah. It did really well in the time that it was there. We did a test, and I was really, really proud of it. But you know, my—what I did though is, you know, very concretely was I started with an idea. I, you know, convinced and aligned with lots of people in various disciplines that this is something that we should try and experiment on. You know, worked with the designers to really design what this could look like. You know, scoped out what types of fabrics because there’s all sorts of different textures out there. Working with, kind of, our sourcing team to understand, like, which vendors do we want to work with. And then typically, in the plush industry, manufacturing back in the day could happen—and in terms of supply chain, manufacturing, and then delivery of product—could take about six months.

GEHRKE: OK … 

LIU: And so when I was there, anything I worked on would, kind of, appear in six months, which is actually very cool. I mean, it’s not like software, where anything you work on is, you’re like boop, compile—oh look [there] it is. It depends on how fast your computer is. You know, it’s pretty instantaneous compared to six months to see the fruits of your labor. But it was a really, just such a great experience. And then seeing, you know, then going to the parks and seeing children with …

GEHRKE: Yeah, the stuff that you …

LIU: … the thing that I worked on, the thing that I had the idea on, and, like, them going like, Mom, I really want this.

GEHRKE: Right …

LIU: You know, we’re not really selling to the kids; we’re, kind of, selling to the parents.

GEHRKE: It’s a bit like this feeling that we can have here at Microsoft, right, if any of our ideas makes it into products …

LIU: Yup …

GEHRKE: … that are then used by 100 million people and hopefully bring them joy and connection.

LIU: Exactly. And that’s why, like, I just think Microsoft is great, because our portfolio is so broad, and so much of our work touches different parts of our lives. And I’ll even pick on, you know, like I have, you know, in my family, my daughter goes to school—clearly, obviously, she would go to school—but she used Flipgrid, now known as Flip, for a while. And I was like, hey, that’s cool. Like, she uses something that, you know, I don’t directly work on, but my company works on.

GEHRKE: Well, and you were involved with it through Watch For, right …

LIU: Yes, I was …

GEHRKE: … which did become the motivation for Flip.

LIU: Yep. Watch For, you know, helps to detect inappropriate content on Flip. And, you know, that’s super cool because now I’m like, oh, the work that I’m doing actually is directly impacting and helping people like my daughter and making a difference and, you know, keeping users safe from content that maybe we don’t want them to see. You know, other areas like Microsoft Word, I’m like, wow, this is a thing. Like, I’m at the company that makes the thing that I’ve used forever, and, you know, like, it’s just fascinating to see the types of things that we can touch here at Microsoft Research, for example. And how, you know, I, you know, Marie Kondo popularized the term “joy,” like, “sparking joy,” but …

GEHRKE: If you look at an item and if it doesn’t sparkle joy …

LIU: If it doesn’t spark joy, right …

GEHRKE: … then you know on which side it goes.

LIU: Exactly. But, but, you know, like, I’ve always felt like I want the things that I work on to create joy in people. And it was very obvious when you make toys that you see the joy on children’s faces with it. It’s a little bit different, but it’s so much more nuanced and rewarding when you also see, sort of, the products that, the types of things that we work on in research create joy. It’s, you know, it’s funny because I mentioned software is instantaneous in many ways, and then, you know, toys takes a little bit longer. But then, you know, in the types of research that we do, sometimes it takes a little bit longer than, a little bit longer [LAUGHS] …

GEHRKE: It takes years sometimes!

LIU: … than six months. Years to pay off. But, like, that return on that investment is so worth it. And, you know, I see that in, kind of, the work that lots of folks around MSR [Microsoft Research] do today. And knowing that even, sort of, the circles that I hang out in now do such crazy, cool, impactful things that help benefit the world. And, you know, it’s funny, like, never say never. I’m in tech and I love it, and I don’t have a STEM background. I didn’t get a STEM background. I didn’t get it, well, I don’t have a STEM degree. Like, I did not go—like, I can’t code my way out of a paper bag. But the fact that I can still be here and create impact and do meaningful work and, you know, work on things that create joy and positively impact society is, like, it speaks to me like stories speak to me.

GEHRKE: I mean, there’s so many elements that come together in what you’re saying. I mean, research is not a game of the person sitting in the lowly corner on her whiteboard, right? But it’s a team sport.

LIU: Yep.

GEHRKE: It requires many different people with many different skills, right? It requires the spark of ingenuity. It requires, you know, the deep scientific insight. It requires then the scaling and engineering. It requires the PM, right, to make actually the connection to the value, and the execution then requires the designer to actually create that joy with the user interface to seeing how it actually fits.

LIU: Exactly. And it’s fascinating that we sometimes talk about research being like a lonely journey. It can be, but it can also be such an empowering collaborative journey that you can build such incredible cool things when you bring people together—cross-disciplinary people together—to dream bigger and dream about new ideas and new ways of thinking. And, like, that’s why I also love talking to researchers here because they all have such unique perspectives and inner worlds and lives that are frankly so different from my own. And I think when they encounter me, they’re like, she’s very different from us, too.

GEHRKE: But I think these differences are our superpower, right, because …

LIU: Exactly. And that’s what brings us together.

GEHRKE: … they have to be bridged and that brings us together. Exactly. So how, I mean, if you think about Microsoft Research as over here. You’re here in Disney in Florida?

LIU: Yes, yes, yes. So …

GEHRKE: You had quite a few stops along the way.

LIU: I did have a lot of stops along the way.

GEHRKE: And very nonlinear also?

LIU: It was also very nonlinear. So Disney took me to the third, at the time, the third-largest toy company in the US, called JAKKS Pacific, where I worked on again, sort of, Disney-licensed and Mattel-licensed products, so “dress up and role play” toys is what we refer to them as. “Dress up” meaning, like, if you go to your local Target or Walmart or whatever, kind of, large store, they will have in their toy sections like dresses for Disney princesses, for example, or Disney fairies. Like, I worked on stuff like that, which is also very cool because, you know, usually around Halloween time here in the US is when I’m like, hey, I know that. And then that, kind of, took me to a video game accessory organization here in Woodinville.

GEHRKE: There’s the connection to tech starting to appear.

LIU: There’s a little bit connection of tech where I was like, I love video games! And I got to work on audio products there, as well, like headphones. And it was the first time I started working on things that, I’ll just say, had electrons running through them. So I had already worked on things that were, like, both soft lines—we refer to a soft line as bags and things that require, like, fabrics and textiles—and then I worked on hard lines, which were things that are more, things that are more physically rigid, like plastics. And so I was like, OK, well, I’ve worked on hard-lines-like stuff, and now I’m going to work on hard lines with electrons running through them. That’s kind of neat. And I learned all sorts of things about electricity. I was like, oh, this is weird and fascinating and circuits and … . And then I was like, well, this is cool, but … what else is there? And it took me to not a very well-known company in some circles, but a company called Fluke Corporation. Fluke is best known for its digital multimeters, and I worked there on their thermal imaging cameras. So it’s, for people who don’t know, it’s kind of like Predator vision. You can see what’s hot; you can see what’s not. It’s very cool. And Fluke spoke to me because their, you know, not only is their tagline “they keep your world up and running”; a lot of the things that Fluke does, especially when I heard stories from, like, electricians and technicians who use Fluke products, are like, this Fluke saved my life. I’m like, it did? What? And they’re like, you know, I was in a high-voltage situation, and I just wasn’t paying attention. I, you know, didn’t ground properly. And then there was an incident. But, you know, my multimeter survived, and more importantly, I survived. And you’re like, wow, like, that’s, that’s really cool. And so while I was at Fluke, they asked me if I wanted to work on a new IoT project. And I was like, I don’t even know what IoT is. “Internet of Things” … like, OK, well, you said “things” to me, and I like things. I like tangible things. Tell me more. And so that was, kind of, my first foray into things that had … of products with electrons on them with user interfaces and then also with software, like pure software, that were running on devices like your smartphones or your tablets or your computers. And so I started learning more about like, oh, what does software development look like? Oh, it’s a lot faster than hardware development. It’s kind of neat. And then that took me to SpaceX, of all places. It was super weird. Like, SpaceX was like, hey, do you want to come work in software here? I was like, but I’m not a rocket scientist. They’re like, you don’t need to be. I was like, huh, OK. And so I worked on Starlink before Starlink was a real thing. I worked on, kind of, the back-office systems for the ISP. I also worked on what we would refer to as our enterprise resource planning system that powers all of SpaceX. It’s called Warp Drive.

GEHRKE: That’s where you got all your software experience.

LIU: That’s where I learned all about software and working on complex systems, also monoliths and older systems, and how do you think about, you know, sometimes zero-fault tolerance systems and also, that also remain flexible for its users so they can move fast. And then from SpaceX, that took me to a startup called Likewise. It’s here in Bellevue. And then from the startup, I was like, I really like those people in Microsoft. I really want to work in research because they come up with all these cool ideas, and then they could do stuff with it. And I’m such an idea person, and maybe I’m pretty good at execution, but I love the idea side of things. And I discovered that over the course of my career, and that’s actually what brought me here to begin with.

GEHRKE: And that’s, sort of, your superpower that you bring now here. So if I think about a typical day, right, what do you do throughout, throughout your day? What is it, what is it to be a PM manager here at MSR?

LIU: So it’s funny because when I was just a PM and not a manager, I was more, kind of, figuring out, how do I make this product go? How do I make this product ship? How do I move things forward and empower organizations with the products that I—people and organizations on the planet to achieve more [with] what I’m working on? And now as a PM manager, I’m more empowering the people in my team to do that and thinking about uniquely like, who are they, what are their motivations, and then how do I help them grow, and then how do I help their products ship, and how do I help their teams cohere? And so really my day-to-day is so much less, like, being involved in the nitty-gritty details of any project at any point in time, but it’s really meeting with different people around Microsoft Research and just understanding, like, what’s going on and making sure that we’re executing on the impactful work that we want to move forward. You know, it’s boring to say it’s—it doesn’t sound very interesting. Like, mostly, it’s emails and meetings and talking, and, you know, talking to people one-on-one, occasionally writing documents and creating artifacts that matter. But more importantly, I would say it’s creating connections, helping uplift people, and making sure that they are moving and being empowered in the way that they feel that—to help them achieve more.

GEHRKE: That’s super interesting. Maybe in closing, do you have one piece of career advice for everybody, you know, anybody who’s listening? Because you have such an interesting nonlinear career, yet when you are at Disney you couldn’t probably … didn’t imagine that you would end up here at MSR, and you don’t know what, like, we had a little pre-discussion. You said you don’t know where you’re going to go next. So what’s your career advice for any listener?

LIU: I would say, you know, if you’re not sure, it’s OK to not be sure, and, you know, instead of asking yourself why, ask yourself why not. If you look at something and you’re like, hey, that job looks really cool, but I am so unqualified to do it for whatever reason you want to tell yourself, ask yourself why not. Even if it’s, you know, you’re going from toys to something in STEM, or, you know, I’m not a rocket scientist, but somehow, I can create value at SpaceX? Like, if you want to do it, ask yourself why not and try and see what happens. Because if you stop yourself at the start, before you even start trying, then you’re never going to find out what happens next.

[MUSIC]

GEHRKE: It’s just such an amazing note to end on. So thank you very much for the great conversation, Wei.

LIU: Yeah. Thanks, Johannes.

GEHRKE: To learn more about Wei or to see photos of her work and of her childhood in Silicon Valley, visit aka.ms/ResearcherStories (opens in new tab).

[MUSIC FADES]


[1] Liu notes the book was Journey to Topaz by Yoshiko Uchida and the subsequent book Journey Home.

The post What’s Your Story: Weishung Liu appeared first on Microsoft Research.

Read More

The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.

Introduction

In today’s data-driven world, organizations strive to leverage data to train and adapt AI models. However, this pursuit often faces an important challenge: balancing the value of data with the need to safeguard individuals’ right to privacy and comply with data privacy regulations like the General Data Protection Regulation (opens in new tab) (GDPR) and the EU AI Act (opens in new tab)

Synthetic data has emerged as a powerful solution to privacy and compliance challenges. It allows organizations to create realistic and useful datasets, tailored to specific use cases, without compromising individual privacy. This enables organizations to:

  • Train and adapt AI models: Synthetic data can be used to train and adapt models to specific domains and industries, even when real-world data is limited, or privacy concerns exist.
  • Comply with regulations: Since it doesn’t require user data, synthetic data generation helps organizations adhere to data privacy regulations.
  • Unlock new possibilities: Synthetic data opens doors to innovative AI applications that were previously limited by data availability or privacy constraints.

Microsoft’s Phi-3 (opens in new tab) small language model (SLM) is a good example of how synthetic data can contribute to responsible AI development, enabling the creation of powerful language models without compromising privacy. Phi-3 leverages a combination of “textbook quality” web data and LLM-generated synthetic content, creating a strategic approach that doesn’t need real-world personal data. 

However, synthetic data carries limitations. It can be difficult to artificially generate realistic data that anticipates a wide range of use cases and individual scenarios. Furthermore, synthetic data generated by pre-trained large-language models (LLMs) can sometimes reduce accuracy and increase bias on down-stream tasks (opens in new tab). So, how could we generate synthetic data that accurately captures the diversity and specificity of private data while maintaining strict privacy protections for data contributors? 

Differential privacy: A bridge between innovation and privacy

Differentially private (DP) synthetic data generation is a promising solution. It allows developers to pursue innovations in machine learning while prioritizing privacy. The goal of synthetic data generation is to produce data statistically similar to real-world data sources. However, when the data is too similar, replicating uniquely identifying details of the source data, the promise of preserving privacy is compromised. This is where DP can help. DP is a mathematical framework for providing a guarantee that a particular computation is relatively invariant to the addition or removal of a single data contributor. Using DP techniques, researchers can generate synthetic datasets that retain the statistical properties of the original data while ensuring that information that could help identify data contributors remains obscured. 

This blog post explores recent advancements in private synthetic data generation. We examine four recently published research papers that propose innovative techniques for generating synthetic data with strong privacy guarantees, while maintaining its usefulness for analytics, training AI models, and other tasks.

In the remainder of this blog post, we describe each approach in more detail, and present experimental results illustrating their value.

Technical deep dive: Differentially private synthetic data generation 

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Generative LLMs offer the opportunity to produce synthetic text by sampling from LLM outputs. One avenue to generating realistic synthetic text is to fine-tune an LLM using representative data. For example, we could consider fine-tuning a pre-trained LLM on a corpus of scientific papers, enabling the model to more readily produce text that captures the knowledge and writing style used in scientific writing. Suppose, however, that we want to produce synthetic text based on a private corpus of documents. What steps can we take to protect the document authors and any sensitive information in their documents? For example, we may want to produce synthetic medical notes, or personal emails. LLMs have a well-known capacity to memorize training examples, and a model with the potential for reproducing samples from the training set might pose significant privacy risks.

In the paper Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe, researchers from Microsoft presented an approach to leveraging a private data corpus for synthetic generation, without compromising the privacy of the data subjects. This approach uses differentially private stochastic gradient descent (DP-SGD) to fine-tune an LLM on the private documents with a strong privacy guarantee. Differentially private model training provides a mathematical guarantee that the trained model parameters, and any subsequent model outputs, are relatively unaffected by the addition or removal of any single user’s training examples.

The synthetic generation approach described in this work was validated by training on restaurant reviews with varying levels of privacy protection, then prompting the model to generate novel reviews. These reviews were then used for downstream classification tasks, such as sentiment prediction and restaurant genre classification, and the results, which are shown in Table 1, demonstrated only small accuracy penalties compared to training on the raw private data. This approach unlocks a powerful way for realistic synthetic data to be generated from private data without compromising privacy or confidentiality.

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.
Figure 1: By fine-tuning an LLM with differential privacy, the model can be used to generate synthetic examples that resemble the private corpus 
A table of results with four columns and four rows. The columns indicate data type, data generator, epsilon, rating and category.  The first row indicates “original” data type and no entry for data generator or epsilon. The rating is 0.733 and category is 0.775.  The following three rows all indicate Synthetic for data type and GPT2, GPT2-Medium, and GPT2-Large for the data generator. Each row is further divided into two rows corresponding to epsilon = 4 and epsilon = infinity respectively. In all cases the rating and category scores are lower than the row marked original by a few percentage points. The rows corresponding to epsilon = 4 are lower than corresponding rows marked epsilon=infinity by 1-2 percentage points. In general the epsilon = 4 rows have increased scores for larger GPT2 models, while the epsilon=infinity rows are relatively flat.
Table 1: Various versions of GPT-2 were trained on restaurant reviews both with (ε=4) and without (ε =∞) a privacy guarantee. These models were used to produce synthetic training sets, which were used to train classification models for review rating and restaurant category, and subsequently evaluated for accuracy on a private hold-out set. The results show that models trained on the synthetic data can achieve accuracy competitive with models trained without a privacy guarantee. 

Differentially Private Synthetic Data via Foundation Model APIs

While the ACL paper demonstrated a robust approach to synthetic data generation, fine-tuning a large model can be impractical. Model training requires significant computing capacity and some of the most powerful models available are proprietary and not accessible for DP training. Recognizing this challenge, researchers at Microsoft explored whether synthetic data can be generated directly using only inference API access to a model, even while utilizing an untrusted model controlled by a third party. Crucially, the synthetic data should resemble a targeted private corpus, and yield a similar DP guarantee as was met in the previous work based on model training. In two separate papers, the authors demonstrate an approach to this problem using a differentially private sampling approach called Private Evolution (PE). 

Two independent flow charts. In the first, private data is applied to a pre-trained model using DP-SGD. The fine-tuned model is used to produce differentially private synthetic data.  In the second chart, a pre-trained model is prompted via its API to produce generic data. Private data is used to inform selection of the generated data, with a strong privacy guarantee, yielding differentially private synthetic data.
Figure 2: Instead of fine-tuning pre-trained models with DP-SGD (top figure), Private Evolution (PE) only requires accessing the inference APIs of a model (bottom figure). Thus, PE is easily compatible with foundation models that are difficult to DP-fine-tune (e.g., because they are too large) or infeasible to fine-tune (e.g., they are only accessible through inference APIs).

Synthetic image generation using foundation model APIs: In Differentially Private Synthetic Data via Foundation Model APIs 1: Images, the authors introduced Private Evolution (PE), an approach that enables DP image synthesis merely through inference APIs of a generative model. PE operates by sampling from a pre-trained diffusion model such as Stable Diffusion, which has no knowledge of the private corpus. PE then iteratively compares these samples to the private corpus, keeps the ones that are most similar to the private corpus, and uses the pre-trained model to generate more such samples. Crucially, the comparison to the private corpus is done with a DP guarantee, so that any information revealed about the private corpus is strictly bounded. Also, all the queries to the foundation model APIs satisfy the same DP guarantee, so that we can safely use APIs provided by (untrusted) third parties. 

Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 
Figure 3: Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 

Even without doing any model training, PE significantly advances state-of-the-art results on some of the datasets. For example, on CIFAR10 dataset (opens in new tab), we achieve FID score (image quality measure, smaller is better) ≤ 7.9 with DP privacy cost ϵ = 0.67, significantly improving the previous SOTA from ϵ = 32. In the paper, we also show that PE requires less computational resource (GPU hours) than DP fine-tuning to achieve such results. 

A 2D line chart with six line series, comprising conditional and unconditional variations on the private evolution and DP-MEPF methods, as well as DP-GAN and DP-Diffusion. The x axis presents values of epsilon from 0 to 32. The y axis presents values of the image quality measure FID from 0 to 80, where values are better.  All six series show decreasing values of FID for increasing values of epsilon. Both of the series corresponding to private evolution show significantly lower FID values, ranging from about epsilon = 0.1 to epsilon = 2.
Figure 4: FID (image quality measure, lower is better) vs. DP privacy cost ϵ on CIFAR10 (δ = 10−5 ). (Un)cond means (un)conditional generation. Ours achieves the best privacy-quality trade-off compared to prior training-based approaches.
An array of ten rows of thumbnails, each row depicting ten instances of generated synthetic images. The rows include birds, cars, cats, dogs, and other animals, planes, boats and trucks.  Most of the images appear to be realistic with some exhibiting unusual artifacts.
Figure 5: Private Evolution-generated samples using CIFAR-10 as the private corpus (ε =0.67, δ =10-5). Each row corresponds to one object class.

Synthetic Text Generation using foundation model APIs: the PE approach described above works well for images since it is easy to produce nearby perturbations of promising images. In Differentially Private Synthetic Data via Foundation Model APIs 2: Text, Microsoft researchers explored whether a similar approach could be applied to text. Their method, called Augmented Private Evolution (Aug-PE), operates similarly to the basic PE approach, but leverages the power of a pre-trained LLM to produce variations and re-wordings of input text. Aug-PE also proposes some fundamental algorithmic improvements that may benefit future development of PE. 

An overview of the Augmented Private Evolution algorithm for synthetic text generation. Step 1 invokes a language model to produce random text. Step 2.1 uses private data and differential private to vote on the best candidates from step 1, Step 2.2 samples from this differentially private histogram to produce a selected set of generations. Step 2.3 prompts a language model to produce variants of the selected generations, and steps 2.1 to 2.3 are repeated.
Figure 6: Augmented Private Evolution (Aug-PE) leverages a foundational LLM to synthesize text and compare in a privacy-preserving way with a private corpus. Similar to PE for images, in Aug-PE, samples that more closely resemble the private data are retained and refined to produce new synthetic text with a strong privacy guarantee. The illustration shows how we generate DP synthetic reviews for restaurants given two private samples.

Results show that Aug-PE is a promising alternative to DP-fine-tuning for DP text synthesis. With the same foundation model, PE can match or even beat DP-fine-tuning in terms of the trade-off between text quality and privacy. Moreover, as Aug-PE only requires inference APIs, Aug-PE can easily work with the most advanced LLMs such as GPT-3.5, LLaMA, and Mixtral to further improve the text quality. In terms of computational cost (GPU hours), PE can achieve up to 65.7x speedup compared to the DP fine-tuning approach.

A table of results for area and rating classification accuracy for a variety of models and comparing PE with DP synthesis. The table contains the remark that with the same model PE matches or beats DP fine-tuning on text quality vs privacy, and PE works well with advanced LLMs which may be challenging or impossible to fine-tune. The models compared include three sizes of GPT-2, several major open source models, and GPT-3.5. PE on the Mixtral model shows the strongest Area classification accuracy at 43.6 while PE on GPT-3.5 shows the strongest Rating classification accuracy at 43.1.
Table 2: Results on ICLR 2023 paper reviews (ϵ = 1). We use each method to generate DP synthetic paper reviews and test the utility of the data by training downstream paper area or rating classifiers and evaluate their accuracies on the real hold-out data (higher is better). Under the same base model (GPT-2 families), PE achieves competitive results with DP fine-tuning. PE also supports advanced LLMs that may be challenging to work with DP fine-tuning due to large model sizes or black box access.

Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation

In-context learning is a technique for performing tasks with an LLM by providing a sample of demonstration examples in the prompt of the LLM before presenting it with a specific task. For example, we might show a few movie plots and their genre and ask the LLM to suggest the genre for a particular plot of interest. In-context learning harnesses the strong generalization capabilities of LLMs, but it requires a sample of labeled demonstration examples at inference time. How can we perform in-context learning when the only available labeled examples are private? A naïve solution might be to use the private examples but hide the demonstration prompt from the user. However, the threat posed by jailbreak attacks puts these examples at risk for exposure to a malicious user.

In Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation, Microsoft researchers explored how demonstration examples can be synthesized from a private corpus with a privacy guarantee. The method operates by incrementally drawing samples from a token distribution defined by the private examples but with noise added to the distribution. The noise is calibrated to ensure a bound on the privacy lost with each sample. The research demonstrated that in-context learning can out-perform zero-shot learning (querying a model without any demonstration examples) and comes close to performing at the same level as the case with no privacy mitigations, as shown in Table 3. 

An overview of differentially private few-shot generation.  A round of token generation is depicted with four steps. Given the tokens generated so far, step 1 selects the relevant private data. Step 2 takes an M by N sample of the private data, producing M batches of N examples. Step 3 assembles M LLM prompts with task instructions and the N examples appended. Step 4 feeds the M prompts to the LLM and performs noisy aggregation over the LLM’s output probabilities to select the next generated token.
Figure 7: Illustration of DP few-shot generation. The example shows a synthetic demonstration generated token by token for the topic school with a differentially private guarantee. As new tokens are sampled, the private examples inform the sampling probability of each subsequent token, with noise injected to preserve privacy. 
A table of results for private in-context learning tasks, including text classification on three datasets (AGNews, DBPedia, and TREC) and information extraction on two datasets (MIT-G and MIT-D).  Accuracy is compared across two cases where epsilon = 0 (zero-shot and four-shot) and values of epsilon at 1, 2, 4, 8 and infinity. Generally, accuracy improves as epsilon increases but epsilon = 8 often outperforms epsilon = infinity.
Table 3: For classification and information extraction tasks, DP in-context learning achieves accuracy similar to non-private ICL (ϵ =∞) 

Conclusion

Synthetic data generation presents enormous opportunities to develop AI systems without compromising end-user privacy. In this blog post, we have explored recent innovations in synthetic data generation with strong privacy guarantees. These approaches can enable practitioners to produce synthetic data from private entities, while mitigating the risk that private information might be revealed. While these approaches are highly promising, they do have limitations. For example, we are currently limited to producing relatively short text passages. Future work will continue to explore the opportunities presented by these approaches, with an aim to produce increasingly realistic data with strong privacy guarantees.

Acknowledgments: The authors are grateful for the contributions of the co-authors of the papers reviewed in this blog post: Xiang Yue, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, Chulin Xie, Arturs Backurs, Sivakanth Gopi, Da Yu, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Janardhan Kulkarni, Xinyu Tang, Richard Shin, Andre Manoel, and Niloofar Mireshghallah.

The post The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI appeared first on Microsoft Research.

Read More