Microsoft AI – Page 22

Efficient and hardware-friendly neural architecture search with SpaceEvo

October 6, 2023

by Alyssa Hughes Microsoft AI

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

In the field of deep learning, where breakthroughs like the models ResNet (opens in new tab) and BERT (opens in new tab) have achieved remarkable success, a key challenge remains: developing efficient deep neural network (DNN) models that both excel in performance and minimize latency across diverse devices. To address this, researchers have introduced hardware-aware neural architecture search (NAS) to automate efficient model design for various hardware configurations. This approach involves a predefined search space, search algorithm, accuracy estimation, and hardware-specific cost prediction models.

However, optimizing the search space itself has often been overlooked. Current efforts rely mainly on MobileNets-based search spaces designed to minimize latency on mobile CPUs. But manual designs may not always align with different hardware requirements, limiting their suitability for a diverse range of devices.

In the paper, “SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference (opens in new tab),” presented at ICCV 2023, (opens in new tab) we introduce SpaceEvo, a novel method that automatically creates specialized search spaces optimized for efficient INT8 inference on specific hardware platforms. What sets SpaceEvo apart is its ability to perform this design process automatically, creating a search space tailored for hardware-specific, quantization-friendly NAS.

Notably, SpaceEvo’s lightweight design makes it ideal for practical applications, requiring only 25 GPU hours to create a hardware-specific solution and making it a cost-effective choice for hardware-aware NAS. This specialized search space, with hardware-preferred operators and configurations, enables the exploration of larger, more efficient models with low INT8 latency. Figure 1 demonstrates that our search space consistently outperforms existing alternatives in INT8 model quality. Conducting neural architecture searches within this hardware-friendly space yields models that set new INT8 accuracy benchmarks.

Figure1: The image displays 4 sub-figures, each illustrating model accuracy error distribution when sampling models within INT8 quantized latency at 10 ms on a VNNI CPU, 15 ms on a VNNI CPU, 10 ms on a Pixel 4 CPU, and 20ms on a Pixel CPU for various Search Spaces. Each sub-figure contains 4 – 5 curves, representing model accuracy error distributions from our search space, ProxylessNAS search space, MobileNetv3 search space, ResNet search space, and AttentiveNAS search space. Our search space consistently delivers superior INT8 model populations, outperforming state-of-the-art alternatives under varying hardware and latency constraints. — Figure 1. Error distribution of INT8 quantized models across various NAS search spaces. Our search space consistently outperforms state-of-the-art alternatives in INT8 model quality.

On-device quantization latency analysis

We began our investigation by trying to understand INT8 quantized latency factors and their implications for search space design. We conducted our study on two widely used devices: an Intel CPU with VNNI instructions and onnxruntime support, and a Pixel 4 phone CPU with TFLite 2.7.

Our study revealed two critical findings:

Both the choice of operator type and configurations, like channel width, significantly affect INT8 latency, illustrated in Figure 2. For instance, operators like Squeeze-and-Excitation and Hardswish, while enhancing accuracy with minimal latency, can lead to slower INT8 inference on Intel CPUs. This slowdown primarily arises from the added costs of data transformation between INT32 and INT8, which outweigh the latency reduction achieved through INT8 computation.
Quantization efficiency varies among different devices, and preferred operator types can be contradictory.

Figure2: The image showcases a table (left) and a figure (right). The table on the left, labeled — Figure 2. Left: Selecting different operator types results in notably distinct quantized speed improvements. Right: Conv1x1 speed enhancements across various channel numbers.

Finding diverse, efficient quantized models with SpaceEvo

Unlike traditional architecture search, which aims to find the best single model, our objective is to uncover a diverse population of billions of accurate and INT8 latency-friendly architectures within the search space.

Drawing inspiration from neural architecture search, we introduced an evolutionary search algorithm to explore this quantization-friendly model population in SpaceEvo. Our approach incorporated three key techniques:

The introduction of the Q-T score as a metric to measure the quantization-friendliness of a candidate search space, based on the INT8 accuracy-latency of top-tier subnets.
Redesigned search algorithms that focus on exploring a collection of model populations (i.e., the search space) within the vast hyperspace, as illustrated in Figure 3. This is achieved through the “elastic stage,” which divides the search space into a sequence of elastic stages, allowing traditional evolution methods like aging evolution to explore effectively.
A block-wise search space quantization scheme to reduce the training costs associated with exploring a search space that has a maximum Q-T score.

After discovering the search space, we employed a two-stage NAS process to train a quantized-for-all supernet over the search space. This ensured that all candidate models could achieve comparable quantized accuracy without individual fine-tuning or quantization. We utilized evolutionary search and nn-Meter (opens in new tab) for INT8 latency prediction to identify the best quantized models under various INT8 latency constraints. Figure 3 shows the overall design process.

Figure3: The image depicts a flowchart that outlines the complete SpaceEvo process and its application for NAS. Starting with a large hyperspace, an evolution search algorithm explores a candidate search space. A quality estimator then assesses its quality score based on INT8 latency and accuracy. This score is used as a reward for the algorithm, guiding further exploration until a suitable search space is found. A quantized-for-all supernet is then trained over this space, enabling hardware-aware NAS for deploying models within various INT8 latency constraints. — Figure 3: The complete SpaceEvo process and application for NAS

Extensive experiments on two real-world edge devices and ImageNet demonstrated that our automatically designed search spaces significantly surpass manually designed search spaces. Table 1 showcases our discovered models, SEQnet, setting new benchmarks for INT8 quantized accuracy-latency tradeoffs.

(a) Results on the Intel VNNI CPU with onnxruntime
Model	Top-1 Acc %	Latency		Top-1 Acc %	FLOPs
Model	INT8	INT8	Speedup	FP32	FLOPs
MobileNetV3Small	66.3	4.4 ms	1.1x	67.4	56M
SEQnet@cpu-A0	74.7	4.4 ms	2.0x	74.8	163M
MobileNetV3Large	74.5	10.3 ms	1.5x	75.2	219M
SEQnet@cpu-A1	77.4	8.8 ms	2.4x	77.5	358M
FBNetV3-A	78.2	27.7 ms	1.3x	79.1	357M
SEQnet@cpu-A4	80.0	24.4 ms	2.4x	80.1	1267M
(b) Results on the Google Pixel 4 with TFLite
MobileNetV3Small	66.3	6.4 ms	1.3x	67.4	56M
SEQnet@pixel4-A0	73.6	5.9 ms	2.1x	73.7	107M
MobileNetV3Large	74.5	15.7 ms	1.5x	75.2	219M
EfficientNet-B0	76.7	36.4 ms	1.7x	77.3	390M
SEQnet@pixel4-A1	77.6	14.7 ms	2.2x	77.7	274M

Table 1. Our automated search spaces outperformed manual ones in ImageNet results on two devices. Speedup: INT8 latency compared with FP32 inference.

Potential for sustainable and efficient computing

SpaceEvo is the first attempt to address the hardware-friendly search space optimization challenge in NAS, paving the way for designing effective low-latency DNN models for diverse real-world edge devices. Looking ahead, the implications of SpaceEvo reach far beyond its initial achievements. Its potential extends to applications for other crucial deployment metrics, such as energy and memory consumption, enhancing the sustainability of edge computing solutions.

We are exploring adapting these methods to support diverse model architectures like transformers, further expanding its role in evolving deep learning model design and efficient deployment.

The post Efficient and hardware-friendly neural architecture search with SpaceEvo appeared first on Microsoft Research.

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

October 5, 2023

by Alyssa Hughes Microsoft AI

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

When was the last time you were faced with a task you had no clue how to tackle? Maybe it was fixing a broken bike, replacing a printer toner, or making a cup of espresso? In such circumstances, your usual options might include reaching out to a knowledgeable friend or relative for assistance. Alternatively, you might resort to scouring the internet, conducting a web search, posing questions on online forums, or seeking out relevant instructional videos. But what if there were another option? What if you could turn to an AI assistant, or copilot, for help?

AI in the real world

Our daily lives are filled with a wide range of tasks, both for work and leisure, spanning the digital and physical realms. We often find ourselves in need of guidance to learn and carry out these tasks effectively. Recent advances in AI, particularly in the areas of large language and multimodal models, have given rise to intelligent digital agents. However, when it comes to the physical world, where we perform a significant number of our tasks, AI systems have historically faced greater challenges.

A longstanding aspiration within the AI community has been to develop an interactive AI assistant capable of perceiving, reasoning, and collaborating with people in the real world. Whether it’s scenarios like autonomous driving, robot navigation and manipulation, hazard detection in industrial settings, or support and guidance for mixed-reality tasks, progress in physical activities has been slower and more incremental compared with their fully digital counterparts.

The promise and challenge of interactive AI “copilots”

There is great potential for developing interactive AI copilots to assist people with real-world tasks, but there are also obstacles. The key challenge is that current state-of-the-art AI assistants lack firsthand experience in the physical world. Consequently, they cannot perceive the state of the real world and actively intervene when necessary. This limitation stems from a lack of training on the specific data required for perception, reasoning, and modeling in such scenarios. In terms of AI development, there’s a saying that “data is king.” This challenge is no exception. To advance interactive AI agents for physical tasks, we must thoroughly understand the problem domain and establish a gold standard for copilots’ capabilities.

A new multimodal interactive dataset

As a first step in this direction, we are excited to share our paper, “HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World (opens in new tab),” presented at ICCV 2023 (opens in new tab). HoloAssist is a large-scale egocentric, or first-person, human interaction dataset, where two people collaboratively execute physical manipulation tasks. A task performer executes a task while wearing a mixed-reality headset that captures seven synchronized data streams, as shown in Figure 1. Simultaneously, a task instructor observes the performer’s first-person video feed in real time and offers verbal instruction.

An image illustrating the setup for the HoloAssist dataset, which features a two-person interactive assistive task-completion setting. A task-performer is wearing a mixed reality headset while an instructor watches the first-person video feed and provides instructions. Eight modalities are captured, RGB, eye gaze, hand pose, head pose, depth, IMU, audio, text transcription. — Figure 1: HoloAssist features a two-person interactive assistive task-completion setting.

HoloAssist contains a large collection of data, comprising 166 hours of recordings involving 222 diverse participants. These participants form 350 distinct instructor-performer pairs carrying out 20 object-centric manipulation tasks. Video 1 shows how tasks are recorded, while Figure 2 provides a task breakdown. The objects range from common electronic devices to rarer items found in factories and specialized labs. The tasks are generally quite demanding, often requiring instructor assistance for successful completion. To provide comprehensive insights, we’ve captured seven different raw sensor modalities: RGB, depth, head pose, 3D hand pose, eye gaze, audio, and IMU. These modalities help in understanding human intentions, estimating world states, predicting future actions, and more. inally, the eighth modality is an augmentation with third-person manual annotations, consisting of a text summary, intervention types, mistake annotations, and action segments, as illustrated in Figure 3.

Video 1: A sampling of task recordings showcasing color and depth, two of the eight modalities.

Data distribution captured in HoloAssist. On the left, the number of sessions per activity, and on the right, the total length of sessions in minutes. There are 20 tasks: GoPro, Nintendo Switch, DSLR, portable printer, computer, Nespresso machine, standalone printer, big coffee machine, IKEA furniture (stool, utility cart, tray table, nightstand), NavVis laser scanner, ATV motorcycle, wheel belt, and circuit breaker. There are between 25 and 180 sessions per activity and sessions range from 47 to 1390 minutes. — Figure 2: Data distribution captured in HoloAssist. On the left, the number of sessions per activity. On the right, the total session length in minutes.

HoloAssist includes action and conversational annotations and provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types. The image shows examples of each of these. — Figure 3: HoloAssist includes action and conversational annotations, and it also provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.

Towards proactive AI assistants

Our work builds on previous advancements in egocentric vision and embodied AI. Unlike earlier datasets, such as those listed in Table 1, HoloAssist stands out due to its multi-person, interactive task-execution setting. Human interaction during task execution provides a valuable resource for designing AI assistants that are anticipatory and proactive that can provide precisely timed instructions that are grounded in the environment, in contrast with current “chat-based” AI assistants that wait for you to ask a question. This unique scenario is ideal for developing assistive AI agents and complements existing datasets, which contribute rich knowledge and representation.

The table shows a comparison of nine related datasets and simulation platforms and for each dataset the setting, whether it is collaborative and interactive, instructional and procedural, and the number of hours of video. HoloAssist features a multi-person assistive setting which is a unique addition to existing first-person (egocentric) datasets. — Table 1: Comparison of related datasets and simulation platforms. HoloAssist features a multi-person assistive setting, which is a unique addition to existing egocentric (first-person) datasets.

Finally, we evaluated the dataset’s performance on action classification and anticipation tasks, providing empirical results that shed light on the role of different modalities in various tasks. With this dataset, we introduce new tasks and benchmarks focused on mistake detection, intervention type prediction, and 3D hand pose forecasting, all crucial elements for developing intelligent assistants.

Looking forward

This work represents an initial step in broader research that explores how intelligent agents can collaborate with humans in real-world tasks. We’re excited to share this work and our dataset with the community and, anticipate numerous future directions, such as annotating object poses, investigating object-centric models of affordance and manipulations in AI assistance, and AI-assisted planning and state tracking, among others. We believe HoloAssist, along with its associated benchmarks and tools, will benefit future research endeavors focused on building powerful AI assistants for real-world everyday tasks. You can access the HoloAssist dataset and code on GitHub (opens in new tab).

Contributors

Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Marc Pollefeys

The post HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world appeared first on Microsoft Research.

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

October 5, 2023

by Alyssa Hughes Microsoft AI

photos of PhD students Jennifer Scurrell and Alejandro Cuevas along with Senior Researcher Dr. Madeleine Daepp for the Microsoft Research podcast

Every year, interns from academic institutions around the world apply and grow their knowledge as members of the research community at Microsoft. In this Microsoft Research Podcast series, these students join their internship supervisors to share their experience working alongside some of the leading researchers in their respective fields.

In this episode, PhD students Jennifer Scurrell (opens in new tab) and Alejandro Cuevas (opens in new tab) talk to Senior Researcher Dr. Madeleine Daepp (opens in new tab). They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers they admire over coffee to the teamwork they say helped make it possible for them to succeed in the fast-paced environment of industry, and the impact they hope to have with their work.

Learn more:

Automated Interviewer or Augmented Survey? Collecting Social Data with Large Language Models
Publication, September 2023

Transcript

[TEASER]

[MUSIC PLAYS UNDER DIALOGUE]

ALEJANDRO CUEVAS: I think one of my favorite things, which I will definitely miss outside, is in meetings, there’s this culture of sharing links and references as the meeting is going on, and all of a sudden, you end the meeting and you have this body of work in your chat log that you can go and, you know, sift through. 

JENNIFER SCURRELL: So I’m a political scientist and traditionally our dissertations look sometimes very different. So that means not really collaborating with other people on papers. With my dissertation, I’m kind of sometimes very alone. So one of the goals this summer was really to explore teamwork, and I will say, it was so much fun, especially in this really, really interdisciplinary team.

[TEASER ENDS]

MADELEINE DAEPP: Welcome to Intern Insights, a Microsoft Research Podcast featuring the brilliant students who are contributing to the research and advances at Microsoft as part of the renowned internship program at Microsoft Research. I’m Dr. Madeleine Daepp, a Senior Researcher at Microsoft Research, and today I’m talking with two of my interns, Jennifer Scurrell and Alejandro Cuevas Villalba, about their work and their experience this summer at Microsoft Research. Welcome, Jennifer and Alejandro!

JENNIFER SCURRELL: Thanks.

ALEJANDRO CUEVAS: Hi. Glad to be here.

DAEPP: So, of course, I know both of you, but many of the listeners do not. So can you tell us a little bit about yourselves? Let’s start with you, Jennifer.

SCURRELL: Yeah, sure. So I’m from Switzerland. I live in Zurich. I’m a PhD candidate at the Center for Security Studies at ETH Zurich. And I’m actually a political scientist by training. So I did my bachelor’s and my master’s at the University of Zurich in political science, and now I’m actually entering my last year of PhD-ing at ETH. And what I look at in my dissertation is how AI-enhanced bots influence political opinion formation online.

DAEPP: Awesome. Well, it’s been so great to have you, especially with that background, and we’ll talk a little bit about how relevant that is to some of the work all of us are doing this summer. Alejandro, how about you?

CUEVAS: Hi. Thank you, Madeleine. On my end, my name is Alejandro or Alejandro for our Latino listeners! I come from an infamous area of the world called the Triple Border, specifically Ciudad del Este, Paraguay. I grew up there all my life and came to the States then to study. I’m currently at Carnegie Mellon University studying societal computing. I’m going into my fifth year, and I would say I’m an interdisciplinary researcher by force—or by chance! It’s something that I happened to end up after all the work that I did in undergrad and then subsequently in PhD.

DAEPP: Can you tell me a little bit about what societal computing is?

CUEVAS: Yeah. Broadly, the way that I would define it and the way that I approach it is essentially how computer science technologies shape or impact society. And that is very broad but indeed is a very broad field.

DAEPP: Right. And it’s extremely relevant right now as we’re seeing society being reshaped by some of these new technologies. So I think a sort of next question that I have for both of you is, you know, there are so many things you could have done this summer, and I’m really curious how you ended up at Microsoft Research. So maybe, Alejandro, why don’t you start this time?

CUEVAS: Yeah, broadly, the story that I’ve told a few times already to you all, but I’m happy to share over and over again, is that I interned here in 2016 as a software engineer, and I happened to attend a research talk by MSR (Microsoft Research) at the time, and I was a sophomore in an undergrad at the time, and I thought I didn’t really know much about research, honestly. I didn’t know what a PhD really entailed. And so it was my first time sort of seeing this whole nother line of work that I could be doing. And even though, you know, I liked coding in C# at the time, I think the … seeing how researchers in CS were working on mosquitos for whatever reason at Microsoft of all places was just eye opening. And so after that, I started doing research at school, and two of the mentors that I had in undergrad, they had met at MSR in an internship a little while ago, and they both spoke very highly of it. They both had a great time. And it always seemed like whenever I went to conferences or saw papers that I liked, I kept stumbling upon researchers that I admired and that, you know, we’re doing super impactful work, and it almost seemed like a rite of passage that all these people went through. And, you know, I said, I want that. I want to experience that, as well. So I waited a little bit for the pandemic to dwindle down and not have the remote work again because I knew I wanted to be here in person. And then once that happened, I tried my best to, to, to get here. And just yesterday, I shared with my colleague Eva Brown, who is another intern in our group, and I was telling Eva and Jennifer, who’s here in our talk, as well, that I had to cold … you know, I went ahead and cold emailed Glen Weyl, who’s the director of the Plural Technology Collaboratory, and I said, hey, I’m super interested in this job position. I really want to be there. I have these ideas, and I hope you will consider my application. I really wanted to be here this summer.

DAEPP: That’s so fascinating, and I can really empathize with that feeling of sort of learning about Microsoft Research and then starting to see it everywhere. Like you see on conferences. There’s often, you know, a coauthor from Microsoft Research. You look at people’s backstories. And you start to realize sort of how many people across computer science, but also computational social science, also other fields, really sort of have some time that they spend in this sort of magical place. So, Jennifer, what about you?

SCURRELL: Yes, so I can only second that. So because I’m just entering my last PhD year, I have to decide at some point where I want to go. Do I want to stay in academia, or do I want to go to industry? And I said to myself, OK, to really find out what would suit me best, I really … I need to do an internship in, in the industry and especially, as well, because what we just discussed, right? You look at papers and see all these impactful papers with a lot of MSR researchers as authors. And so one day, my PhD colleague in my lab actually sent me the link to the internship advertisement. I was like, OK, this is just perfect. First of all, it’s Microsoft Research, which is one of the leading research facilities in tech industry. And second of all, the description was just … it was just matching. And I tried my best, and, yeah, now I’m here and I’m so happy that it all worked out!

DAEPP: That’s so funny. So just to give you both a little bit of context about some of those internship postings, you know, on our end, we sort of had this set of projects and areas that we were so excited about in the fall of 2022, and then of course in the spring of 2023, the world starts to change here at Microsoft Research, right. Like I start using ChatGPT every day and telling everybody in my family about it. In February, we get our first glimpse of GPT-4, right, through Bing Chat. And then I was sort of walking around and every day somebody is finding some new result about GPT-4 and theory of mind, or GPT-4 and discovery of causal graphs, or, you know, even that it can code 3D video games. And so we started to think, oh, there’s really an opportunity here, and we need to think about what our research is, what’s sort of most impactful. And so especially for a computational social scientist like me, it was sort of like there’s a new microscope, right. There’s this brand-new tool. What can we do? And so, for Alejandro, I think you applied to this sort of like web3 cryptocurrency project with Glen Weyl, and then you came in and I told you I wanted you to work with large language models. Tell us a little bit about how you brought a growth mindset to that context.

CUEVAS: Yeah, it’s really … it’s really funny that you’re mentioning that. I remember going through the interviews at the time and I kept wondering, why are they asking me so much about large language models? Because from my conception, I still thought that I was coming to work on web3 and DAOs—decentralized autonomous organizations—and I had my whole pitch around web3. But then all the interview questions were around large language models, but luckily, I mean, I spend a good amount of time observing internet phenomena, and it was hard to miss the LLM sort of rise, and I was happy that I could answer all those questions. But indeed, I came here, and it was a big emphasis on figuring out a project that was quite different from what I had originally envisioned. But I didn’t really mind. I mean, I wanted to do this internship either way, and throughout my research career, I had the privilege of always being able to tackle different types of problems. I think, just in undergrad, I worked all the way from usable security on one side and all the way to systems security on the other side, and so jumping headfirst into a new problem was just an average PhD day. You know, it’s just another project that needs to be thought through and scoped out and carried out. And I don’t think it ever detracted me too much that it was something that I was not super familiar. Of course, it’s always difficult to venture into a domain that you have a lot to catch up. But it’s also a fun and rewarding experience. It’s a body of literature that I got the opportunity to explore around really talented individuals. And a lot of the literature search that I would have done on my own was facilitated by all these people that were already thinking about these things. So it was a very fast-paced opportunity to acquire a whole new skill set that I think would be relevant to the rest of my work. And so when you mention growth mindset, I think it was just really an opportunity to fully come in and absorb an area that I know it’s going to be impactful—I know it’s going to have benefits on other areas of my work—and, you know, let’s go at it!

DAEPP: Can I just jump in there? I think something that’s so interesting in what you just said was, you know, sometimes when you’re doing your PhD, you’re really sort of doing a literature review, right. You’re really sort of out there searching for article after article, reading textbook chapters, trying to get up to speed. Maybe you get to take a class. But here what I saw you do was you sort of went door knocking, right. You said, oh, you know, here’s the world’s expert in this area. Or, hey, I need to do something conversationally interactive—who do I go talk to? So can you talk a little bit more about sort of what it looked like to, to ask other researchers, ask other PhD students, ask other interns for help?

CUEVAS: Yeah. Absolutely. I think one of my favorite things, which I will definitely miss outside, is in meetings, there’s this culture of sharing links and references as the meeting is going on, and all of a sudden, you end the meeting and you have this body of work in your chat log that you can go and, you know, sift through. And people did that so much at the beginning and throughout the internship that we would be talking about some topic and there will be three references on the chat, you know, regarding that topic, three papers to read. And I think that was really, you know, it was eye opening in the sense that I didn’t have to, you know, go to Google Scholar and figure out where to find papers. I … there was a network of people around me already that had all this expertise. And so I think it played a phenomenal role from the get-go of facilitating those connections. I think within my first week or two, you had already introduced me to two researchers that were working on areas that I thought were interesting and also that were relevant to the work that I wanted to do. And so having sort of that, from the beginning, really made it easy to then feel comfortable and, you know, empowered to just reach out. You know, that’s something that, from the get-go, people would tell me, you know, just reach out. They’re around the corner. Just go and have a meeting with them. And so, I figured, OK, you know, let’s do it. Why not?

DAEPP: Jennifer, that’s something that you did a lot of as well. You reached out, you sort of asked people for coffee chats. Can you talk a little bit about that experience on your end?

SCURRELL: Exactly. I just wanted to say, can I jump on this? So it was amazing. The first week, we had our initial meeting, and you said, hey, this is … this summer is an opportunity—reach out to people. Network. I was like, OK, uh, sounds easier than it might be, question mark? But it turned out to be so easy! It was really, you know, you just drop an email, you drop a, a message in Teams: hey, I’m interested in your work. I want to learn more. Can we go for a coffee? And next day, I was actually really with this, you know, super-famous researcher drinking coffee in the Building 99 atrium! And I learned so much in 20 minutes I could not read, I don’t know, 10 papers about the issue I wanted to discuss with this person. It’s really amazing how especially also open people are to, yeah, just meet and chat. And that was a really great experience, and I really enjoyed it a lot.

DAEPP: Alejandro, you also had a researcher that you wanted to get coffee with. Tell us a little bit about seeing that researcher in the halls versus how you finally decided to ask them to meet?

CUEVAS: Oh, that was so nerve wracking. This was Cormac Herley, who was just a few doors down from my office, and I would see him every day almost at the kitchen when I would go get coffee and I would look at him; he would look at me. We would do, you know, like a passive nod at the beginning. And eventually—I think only eight weeks in—I finally sort of decided to say, “Hi, hello, my name is Alejandro,” and I completely butchered actually that introduction because I was so nervous. I think my Apple Watch, you know, gave me the “high heart rate” notification. And I think I even said, “Hi, I’m, I’m Cormac!” or something like that. And he was so confused! And I just said, you know, like, I really like this work, and this work, and this work. And he was like, why are you citing this work from 2009? And I was like, it’s just so good, you know? And we happened to sit down right then and there, and I think we chatted for about an hour. Again, just being able to see somebody that you, you admire their work and you, sort of, admire their ideas and then sit down with them, have coffee, interact, chat, and see sort of how they’re thinking about future research problems … I think that was a very exciting thing, and I think it’s one of the things that was in my bucket list, and then I definitely crossed off from the summer!

DAEPP: I think this really speaks to the power also of the in-person internship. This is my first year as an in-person mentor, and I’ve had wonderful experiences with remote intern groups, but this has been really special because I think we’ve been able to, you know, have these sort of in-person meetings and also a very intense bug bash day where we were all sort of in the same room trying to resolve a problem that was very time sensitive. And so one of the things that I think makes this group a little special—although I think it’s actually true of many, many intern groups—is how incredibly interdisciplinary you all are. Like I think of you, Alejandro, really as a computational social scientist, somebody with sort of deep computer science skills. Jennifer of course you have this training in political science. Our other intern, Eva Maxfield Brown, is in an informatics PhD program, and I am a PhD urban planner. So we have really different skill sets and perspectives that we’re bringing in. And so I was wondering maybe, Jennifer, for you, if you could talk a little bit about, sort of, that teamwork experience and how we started to build these collaborations across disciplines.

SCURRELL: Sure. So I’m a political scientist, and traditionally our dissertations look sometimes very different. So I do, for example, I write a monograph. So that means not really collaborating with other people on papers. I’m, with my dissertation, I’m kind of sometimes very alone. And so, one of the goals this summer was really to explore teamwork. And I will say it was so much fun, especially as you said, Madeleine, in this really, really interdisciplinary team. And maybe an anecdote … because interdisciplinary work sometimes could be difficult, right, because we’re coming from different disciplines, we do not talk the same languages. And so in the first few weeks, I had so many meetings with our colleague, fellow intern, Eva, just to discuss projects and to understand, while talking about these projects and ideas and so on, what we’re actually really meaning when we are talking about these issues. So she’s really coming from this computer-sciencey background. I’m coming from the social sciences. And we had many, many sessions before we actually started to understand each other, what we’re actually talking about!

DAEPP: Are there particular words that stand out to you as you were using them differently, like particular barriers to understanding?

SCURRELL: Maybe it was more way of thinking … yeah.

DAEPP: I think there’s definitely a strong, sort of, you know, I, I think some folks come from this very sort of prototyping-development orientation and some folks come from this very strong research orientation. And so that cross-pollination between all of you has been quite magical because I think we’re seeing the builders do some really beautifully designed research and the researchers really think about values in building.

SCURRELL: I think it’s, it’s exactly that; that maybe it’s more the approach like the more data science-y, inductive approach versus the theoretical approach like deductive approach, and, yeah, I really have to say it was so interesting because everybody always listened to each other. I really have to say that. That was one of the things I really appreciated so much. You had so many brainstorming sessions and everybody was listening to each other. And at some point, we also started to understand each other and that was just, yeah, amazing!

DAEPP: So I, I do want to just sort of take you up just a notch and ask you a little bit about the pace that you’ve been working at. Alejandro, you set your goal of building an entire large language model-based prototype, deploying it on Azure, getting it out as a user study with over 450 Microsoft employees, and then now you’re writing up, you know, analyzing and writing up the results. So that’s a lot in three months! And, Jennifer, you are, you know, doing interviews with really sort of elite political stakeholders about artificial intelligence and its impacts on political opinion formation. And so you are running from interview to interview. You know, again, you are doing an entire study about a major timely topic in the span of three months while also making sure that Alejandro has good survey skills and that Eva has a clear theoretical approach, and so I wanted to ask, I think maybe Alejandro first, how does this pace compare? How do you like this pace of work?

CUEVAS: I think in general, the way that I approach work at least is through sprints, and I think that this summer was definitely a moment to sprint. I thought, you know, based on everything that I’ve said so far, it was a summer that I came super-excited whatever the work was going to be and whatever skill sets I needed to bring to the table. I think it was a summer that I wanted to sprint. And it started off, you know, already sprinting. I had the first week of work, I actually had a deadline that I submitted a paper for just, just before we started. And … or …

DAEPP: So that was an external paper? That was not here …

CUEVAS: Right. It was a paper that I … it was for my PhD. It just happened to be the deadline right in the middle of like onboarding, you know, the first week, and I remember, you know, coming already off that and sort of hit the ground running. I was already in that, in that state. And I wanted to do that as a challenge for myself. I wanted to take this opportunity to grow and, you know, learn all these skills that, that were in front of me. And I think the pace was very fast. I think I played a big role dictating that pace. But at the same time, the people around me were able to also match that pace. And I think in particular, a lot of the work that we did together in July felt, you know, truly collaborative and truly exciting because we were able to not only keep each other accountable, but it’s the same sort of feeling that you have when you’re running, you know, next to somebody and that you don’t want to give up because the other person is also, you know, running really hard and you’re keeping each other sort of like motivated. And I think that’s something that I got from the team as a whole. You know, there were people around me that were really excited to do and accomplish, you know, great projects during the summer. You know, looking back, it was an ambitious undertaking, and I think it was … I’m glad that it worked out and that it went well and that we had the results that we had. But I want to also lift up the fact that, you know, through the process, you played a big role in sort of like encouraging that development. You know, a lot of things that you would tell me in July were, you know, you’re working hard; you’ll be judged according to your hard work, not your results. And I think having that peace of mind was also just like, you know, it was a, a good thing to have in the back of my mind. Not only that, the reassurance of it’s not results driven, but the fact itself that you had visibility on the hard work that I was putting. And that made it more encouraging because I knew that you acknowledged and you recognized that, and that played a big role in me wanting to keep putting that effort in.

DAEPP: I was a little worried about how hard you were working! I was like, please have boundaries, you know—go out, play in Seattle! [LAUGHS] All of you, especially, you know, the two of you and Eva Maxfield Brown, but also all of the interns that we get, are so talented and so ambitious and so driven that the summer can be just really exciting. It’s just so magical to be around that energy. Thank you for bringing that. Jennifer, what about you? What was your experience of pace?

SCURRELL: Yeah, OK. I really have to say … so I never worked in industry before, so this pace was really something different than I was [LAUGHS] used to because in my PhD, as I said, I do a monograph. I’m, I’m responsible for my own project and my own time, so I can really take time to, to think about things almost as long as I want, right. I mean, OK, my supervisor has some say there, but … and I was aware that coming here, doing this internship, there will be a different pace, but I never expected it to be so fast. But I must say I really liked it, and I was never scared or stressed or anything because it’s exactly what also you said, Alejandro. I knew people have my back, and it was always a team effort, and everybody was always so passionate and motivating and it really … I felt so at ease even though there were so many stressful situations, but it was always so much fun because we went through it together.

DAEPP: That’s so nice to hear. So, you know, you sort of compared it to your PhD, but Alejandro, you also have a startup. Can you tell us a little bit about that and also how that informs some of the insights you brought to us about pace and intensity and stress and emotional management from that world?

CUEVAS: So Redoux is a fragrance company, and it’s something a little bit different from what I do as a PhD—at first glance, I would say. I’ve been able to in this past four and five years to find a lot of overlap within the, the work that I do at Redoux as well as in my PhD work. We talk a lot about interdisciplinarity in the research that we do within our academic world. But at times, I like to extend interdisciplinarity as well to other aspects just that are beyond academia, as well. Those could include your lived experiences. Those inform the type of work that you do, as well. Those definitely inform mine, and the work that I do at Redoux informs the way that I think about problems, as well. It informs the way that I think about or approach creativity. It helps me learn about storytelling, as well. It helps me learn about what another side of the world cares about or a whole different segment of people or set of problems. What I derive from, from these experiences is that there’s a hard problem, there’s a project, there’s scope, and there is an effort to do your best in whatever realm that looks like. And when I say, you know, there’s that part of how to think about things, but even here at Microsoft, there was a talk about using machine learning to discover new scents and new fragrances, and that is work that is directly related to the work that I care about in the Redoux world. So I think at the end of the day, I keep finding these overlaps between these two worlds, and I take the opportunity to merge them and inform the way that I approach problems as much as I can. And, yeah, it’s been, it’s been rewarding. It’s been tough to handle, but it’s also been, I think, a very formative part of how, I think, I communicate and how I think about broadly creativity and, and my work.

DAEPP: Yeah. I do want to lift up … I love this sort of insight about how I think all, all of you as interns really bring sort of these multifaceted identities into, you know, partly into the work, partly into the team, but also into your social lives. So, Jennifer, for you, tell us a little bit about how you’ve been spending your weekends.

SCURRELL: So, yeah, funny story here. Like, I mean, we are like 10 weeks in the internship now, almost at the end, and only I think three weeks ago or so I was so proudly saying, “Hey, Madeleine, I spent last weekend … I discovered Seattle!” And you were like, wait, it’s already six weeks into your internship. What did you do all the other weekends? I was like, I went hiking! So most of the times … it was really, really amazing to grab, for example, the other interns and go drive two hours, for example, to Mount Rainier and have a wonderful hike there. Or I also went, for example, to Mount Saint Helens. They are these huge lava caves. And, yeah, I went caving and, yeah, hit my head, but everything went fine. [LAUGHS] I’m still here. And no, it’s really amazing what you can do here, especially in nature. But also on campus of course what I really appreciated always were the ice cream socials, for example. And I will really dearly miss them!

DAEPP: There were a lot of ice cream socials for the internship community this summer, and they were wonderful.

CUEVAS: Don’t forget about Puzzle Day, too, Jennifer.

SCURRELL: Oh! Oh yeah, Puzzle Day! I mean, we became fifth, right, out of 65, I think, groups? Yeah, that was really cool.

DAEPP: Wait, tell me about Puzzle Day. I wasn’t there.

CUEVAS: So Puzzle Day was this event organized by Microsoft where you put a team of very, very talented and very smart individuals together and threw really hard puzzles at them. And we managed to assemble a very good team from fellow MSR interns, and we bunkered down and went at really tough puzzles for about five hours. And as you mentioned, we came out fifth out of plenty of teams. I think we could have officially if … you know, we should have come out first or second.

SCURRELL: Yeah … [LAUGHTER]

CUEVAS: But that’s a story for another day! But that was one of the fun things, I think, at least Jennifer and I and a few other people got to do as a nice bonding experience. I think apart from that, something that I had the chance to do a lot during the summer was have friends that came and visited me and that we got to explore Seattle with people outside of town. And that was just fun in general, having somebody from, you know, another side of the country come here and discover together a new neighborhood or a new restaurant or a new store, a new museum. I think one of my favorite ones there was when Amelia, my partner, came out here, we discovered a hidden beach in the northwest side of Bellevue. It was a hidden beach access that is not marked. And I think Seattle has this little sort of mysteries or like little hidden spots around town that we got the chance to discover. And I think that was really one of my favorite things during the summer.

SCURRELL: And can I add one more thing? Not to forget about July 4th, our fellow intern Eva invited us to spend Fourth of July with her and her friends. So we also met some locals, and we had a barbecue and we saw the fireworks and that was really, really also a very nice experience.

DAEPP: I think there’s nothing that makes me happier than when I see all of you the next morning and you say, oh, we all had dinner with all of the interns last night. Because I know for me, with my graduate school sort of research experiences, those connections to other interns are the things that have lasted for, you know, that now I look for those people at conferences. I cite their papers. Those are such, such wonderful connections. And so it’s been really cool to see all of you starting to build that sort of relationship, and I’m hopeful that that will, you know, be sort of a real source of value going forward. So I am curious for both of you, based on your experience, would you recommend this internship to other folks considering Microsoft Research?

SCURRELL: I mean, yes, of course! No, definitely. It’s just so amazing with like who you can work with, what resources you have, the whole organization, all the events you can, you can attend. And it’s really … besides learning technical skills, but also I really learned a lot about myself. And I really also take a lot back to Switzerland not only for my PhD but also for my general life. I really learned a lot here.

DAEPP: Can you say what you mean by that?

SCURRELL: It’s just this open-mindedness. I really must say … so in the coffee talks, but also in … we have had so many brainstorming sessions with so many different groups and this, this respect for each other that you’re listening, you’re truly interested in discussing things with your colleagues, and you’re getting so much valuable input. We’re all in the boat together.

DAEPP: I’ve certainly also experienced this, a strong culture of respect for one another, like especially across disciplines.

SCURRELL: Yes.

DAEPP: But then it’s also … we had this session, this weekly session, called “Show Us Your New Clunky Prototypes” by the human-computer interaction team, where people would show very early-stage demos, and it wasn’t considered “demo-able” unless it broke, right. And, Alejandro, you had a, had a quite clunky prototype in exactly that sense. But that to me is exactly sort of this example of how, how much people gave actionable, useful feedback to very clunky, very early-stage ideas, really sort of thinking about what, what is this trying to be? What can this be? And emphasizing what was good about it. Is, is that your experience?

SCURRELL: Yeah. I also … I must second that it’s really this not having to worry to fail or putting something very rough out there. People still, you know, take you serious [LAUGHS] and want you to succeed. And first, I was really nervous at the beginning of my internship because, you know, all the very famous and like super-interesting and super-successful researchers you’re around, but I soon really relaxed because I understood that people really are interested in working together successfully and supporting each other. And that was really, really nice to experience.

DAEPP: Alejandro, what about you? You had looked forward to being at Microsoft Research really for years before you came here. Did it live up to your expectations?

CUEVAS: Yeah, I think I would recommend it to anybody who wants to work with a cohort of really, really talented individuals, anybody who wants to create some long-lasting friendships and professional connections, or anybody who wants to be part of a pretty solid network of alumni interns, which I think is pretty remarkable. Or anybody who wants to come and work in a fast-paced environment with a ton of really talented individuals across numerous, numerous disciplines. And if anything from that list, I think, appeals to anybody, I think Microsoft Research is a very good place to come and spend your summer at.

DAEPP: So last question for both of you. It’s a big question. What impact do you want your work to have going forward? I mean, you are both really exciting young scholars. I’m so grateful to be a part of your journey. Can you tell me just, just think a little bit forward. What impact do you want to have with your work, and what are you hoping this next stage of your career looks like?

CUEVAS: Something that I’ve been thinking a lot and wanting to venture as I head into some of my last years of PhD is the dissemination of my work, but not necessarily in the traditional academic sense where we find other academics who find our work interesting, but in the sense that I think I want to work on problems that interest and appeal to the broader public and something that is communicated to and accessible to people who are outside of academia. I want to position the work that I’m doing particularly, you know, around cryptocurrencies and around what I like to say like the back alleys of the internet … I want to have that work accessible and to create an appeal to a broader public. I think something that, when I was younger, I wish I knew was … or I wish I had more exposure to what research was and what a PhD was, particularly coming from Paraguay, right, a place that, I shared with you, when I was in high school, had one patent a year. A place where I went to my first science fair when I was a senior in high school. And I want my work to be not only accessible from a place of interest but accessible in a way that it can tell a story that people find so compelling that they feel inspired to work in something similar or inspired to, to go out and investigate phenomena that they think is fascinating. And I want to be able to transmit this excitement and transmit sort of this joy, I guess, of working on something that interests you with the broader public and hopefully appeal and again inspire somebody who may never have thought that this is a career that they could consider or a path that is potentially open to them.

DAEPP: Yeah, I really hope that you are the first of many brilliant Paraguayan computer scientists who come here for the internship!

CUEVAS: I hope so, too.

DAEPP: Jennifer, what about you?

SCURRELL: So for me actually, it would be important that my work—because I’m a political scientist, social scientist, I’m working specifically on sociotechnical problems—that my research also really, you know, gives something back to society and that I also really have, or that my work or our work, really has impact on a higher level to solve the big problems we have in this world, specifically, for example, with my work regarding how AI impacts political opinion formation; regarding, for example, next elections that I can give some input how we could do better with AI, with technology, to solve these societal problems. And especially what I also want to say, as a first-generation academic, I hope to inspire people that are like me to, yeah, reach high!

DAEPP: Well, thank you so much to both of you. I could not be more grateful for such a wonderful team and collaboration. And I really do think you’re going to make major impacts on important sociotechnical problems, on showing new methods with large language models, and I’m excited to advocate for your career going forward.

[OUTRO MUSIC]

Thank you for a wonderful summer.

CUEVAS: Thank you so much.

SCURRELL: Thank you.

The post Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas appeared first on Microsoft Research.

Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI

October 3, 2023

by Alyssa Hughes Microsoft AI

The latest advances in artificial intelligence have sparked broad public interest and excitement, and the sciences are no exception. Increasingly capable foundation models are fuelling a fundamental shift in computing research, natural sciences, social sciences, and even computing education itself. As industry-led advances in AI continue to reach new heights, Microsoft Research believes that a vibrant and diverse research ecosystem is essential to realizing the promise of AI. This means ensuring that the academic research community, and especially researchers working outside computer science, can tap into these capabilities. Their depth and breadth of expertise across disciplines, cultures and languages can contribute meaningfully to our ability to use AI to address some of the world’s greatest technical, scientific, and societal challenges.

To this end, Microsoft Research has established Accelerate Foundation Models Research (AFMR), a new initiative that brings together an interdisciplinary research community to pursue three goals:

Aligning AI with shared human goals, values, and preferences via research on models, which enhances safety, robustness, sustainability, responsibility, and transparency, while also exploring new evaluation methods to measure the rapidly growing capabilities of new models.
Improving human interactions via sociotechnical research, which enables AI to extend human ingenuity, creativity and productivity, while also working to reduce inequities of access and working to ensure positive benefits for people and societies worldwide.
Accelerating scientific discovery in natural sciences through proactive knowledge discovery, hypothesis generation, and multiscale multimodal data generation.

AFMR is a global research network and a resource platform that enables researchers in computer science and many other disciplines to engage with some of the greatest technical and societal challenges of our time. This includes a grant program that provides access to state-of-the-art foundation models hosted through Microsoft Azure AI.

The goal is to foster more collaborations across disciplines, institutions, and sectors, and to unleash the full potential of AI for a wide range of research questions, applications, and societal contexts.

Following a successful pilot program and initial call for proposals (CFP), details of which are provided below, we are committed to continuing this work and can expect to solicit additional proposals throughout the coming year. Visit the AFMR site to learn more about upcoming programs and events, read peer-reviewed work that has resulted from the program and find resources to accelerate research and collaborations.

Inspiring research in the era of AI

When ChatGPT was released in the fall of 2022, it quickly became clear that this new technology and tool would play a central role in AI computing research and applications.

“As a natural language processing (NLP) researcher, I was excited at first by ChatGPT’s potential to stimulate an AI revolution,” said Evelyne Viegas, senior director of research engagement at Microsoft Research. “Soon, I became concerned about a potential lack of access to this resource outside of industry, which could delay important progress in academic settings.”

When Microsoft enabled access to OpenAI models (Embeddings series, GPT-3.5-Turbo series, and GPT-4 series) via the Azure AI services, it created an opportunity to engage with the academic community to learn about their needs and aspirations and start enabling them. A team at Microsoft Research conducted a pilot program offering model access to a small number of participants, and the success of this effort inspired a broader and more sustained program.

Research topics undertaken as part of the pilot reflect the ambitions of AI research at Microsoft in understanding general AI, driving model innovation, ensuring social benefit, transforming scientific discovery, and extending human capabilities across different domains (e.g., astronomy, education, health, law, society).

Although the research supported by this pilot is still underway, the examples below illustrate the possibilities of opening access to leading-edge models to a diverse group of researchers:

Integrating ChatGPT into English as a Foreign Language (EFL) Writing Education – Korea Advanced Institute of Science and Technology (KAIST)

This project explores how students can utilize generative AI for interactive revision in EFL writing. Because the majority of KAIST courses are given in English, the sooner non-English speakers can learn the language the better they will be able to participate in their classes. While earlier chatbots have been used for EFL, language learners found them unengaging. With Azure OpenAI Service, the KAIST team is gathering data to show how the unique capabilities of a GPT-4-based chatbot are accelerating learning while making the learner’s experience more engaging.

Lightweight Adaptation of LLMs for Healthcare Applications – Stanford University

This work focuses on accelerating the task of report summarization for radiologists to improve workflow and decrease the time needed to generate an accurate report. It uses domain adaptation via pretraining on biomedical text, or clinical text and discrete prompting or fine-tuning. Initial results are promising, showing the added value of using foundation models for some clinical tasks.

AI-Based Traffic Monitoring System using Physics-Informed Neural Networks and GPT Models – North Carolina A&T State University

Researchers are creating a traffic monitoring system using data collected from unmanned aerial vehicles (UAVs) to fine-tune foundation models for video analysis and traffic state estimation. This work can directly benefit transportation agencies and city planners, helping them understand traffic patterns, congestion, and safety hazards.

Forging New Horizons in Astronomy – Harvard University

This project seeks to enhance human interaction with astronomy literature utilizing the capabilities of the large language models (LLM), particularly GPT-4. This work employs in-context prompting techniques to expose the model to astronomy papers to build an astronomy-focused chat application to engage the broader community.

Expanding AFMR

Much experimentation remains to be done with foundation models. The AFMR CFP invited the community to develop proposals focused on the goals and questions below:

Aligning AI systems with human goals and preferences
Advancing beneficial applications of AI
Accelerating scientific discovery in the natural and life sciences

The response to the AFMR Fall CFP has been phenomenal, with close to 400 proposals from 170 universities across 33 countries.

“Research undertaken by the principal investigators brings the promise to advance research across a greater breadth of research pursuits, application domains, and societal contexts than we could have imagined,” Viegas said. “It covers a vast range of scientific and sociotechnical topics: creativity, culture, economy, education, finance, health, causality, evaluation, augmentation and adaptation, multimodal, responsible AI, robotics, scientific discovery, software and society. It is inspiring to see experts from different countries with different cultures, languages, institutions, and departments, including computer science, social science, natural sciences, humanities, medicine, music, all come together to work on democratizing AI and work on solving some of the greatest technical and societal challenges of tomorrow.”

The post Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI appeared first on Microsoft Research.

Research Focus: Week of September 25, 2023

September 27, 2023

by Brenda Potts Microsoft AI

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 25 | Week of September 25, 2023

NEW RESEARCH

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases – prefill phase, which processes the input prompt, and decode phase, which generates output tokens autoregressively. While the prefill phase effectively saturates graphics processing unit (GPU) compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.

In a new paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, researchers from Microsoft present a solution to these challenges that yields significant improvements in inference performance across models and hardware. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. Chunked-prefills allow constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles.

Read the paper

NEW RESEARCH

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M (opens in new tab). This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories.

In a new paper: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory, researchers from Microsoft propose an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, DragNUWA simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, the researchers propose trajectory modeling with three aspects: a trajectory sampler (TS) to enable open-domain control of arbitrary trajectories, a multiscale fusion (MF) to control trajectories in different granularities, and an adaptive training (AT) strategy to generate consistent videos following trajectories. Their experiments demonstrate DragNUWA’s superior performance in fine-grained control in video generation.

DragNUWA is purely a research project and there are no current plans to incorporate DragNUWA into a product. Any further research will continue to follow Microsoft AI principles.

NEW RESEARCH

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

Understanding cortical responses to human visual perception has emerged a research hotspot. Yet, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to recent advances in both neuroscience and artificial intelligence, researchers have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches.

In a new paper: Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals, researchers from Microsoft reconstruct observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notoriously noisy, processing and extracting useful information requires more dedicated efforts. The researchers propose a comprehensive pipeline, named NeuroImagen, to incorporate a novel multi-level perceptual information decoding to draw multi-grained and heterogeneous outputs from the given EEG data. A pretrained latent diffusion model then leverages the extracted semantic information to reconstruct the high-resolution visual stimuli images. The experimental results illustrate the effectiveness of image reconstruction and superior quantitative performance of the proposed method.

Read the paper

The post Research Focus: Week of September 25, 2023 appeared first on Microsoft Research.

AutoGen: Enabling next-generation large language model applications

September 25, 2023

by Alyssa Hughes Microsoft AI

“Capabilities like AutoGen are poised to fundamentally transform and extend what large language models are capable of. This is one of the most exciting developments I have seen in AI recently.”

Doug Burger, Technical Fellow, Microsoft

Figure 1 shows three shaded boxes, each containing symbols that represent AutoGen agents and the large language models, tools, and humans that comprise them, and illustrates how AutoGen agents can converse to solve tasks. — Figure 1. AutoGen enables complex LLM-based workflows using multi-agent conversations. (Left) AutoGen agents are customizable and can be based on LLMs, tools, humans, and even a combination of them. (Top-right) Agents can converse to solve tasks. (Bottom-right) The framework supports many additional complex conversation patterns.

It requires a lot of effort and expertise to design, implement, and optimize a workflow that can leverage the full potential of large language models (LLMs). Automating these workflows has tremendous value. As developers begin to create increasingly complex LLM-based applications, workflows will inevitably grow more intricate. The potential design space for such workflows could be vast and complex, thereby heightening the challenge of orchestrating an optimal workflow with robust performance.

AutoGen is a framework for simplifying the orchestration, optimization, and automation of LLM workflows. It offers customizable and conversable agents that leverage the strongest capabilities of the most advanced LLMs, like GPT-4, while addressing their limitations by integrating with humans and tools and having conversations between multiple agents via automated chat.

With AutoGen, building a complex multi-agent conversation system boils down to:

Defining a set of agents with specialized capabilities and roles.
Defining the interaction behavior between agents, i.e., what to reply when an agent receives messages from another agent.

Both steps are intuitive and modular, making these agents reusable and composable. For example, to build a system for code-based question answering, one can design the agents and their interactions as in Figure 2. Such a system is shown to reduce the number of manual interactions needed from 3x to 10x in applications like supply-chain optimization (opens in new tab). Using AutoGen leads to more than a 4x reduction in coding effort.

Figure 2 illustrates an example workflow with dotted-line relationships between three AutoGen agents—Commander, Writer, and Safeguard—and how the agents work together to answer code-based questions from users. — Figure 2. An example workflow to address code-based question answering in supply-chain optimization (opens in new tab). The Commander receives user questions and coordinates with the Writer and Safeguard. The Writer crafts the code and interpretation, the Safeguard ensures safety, and the Commander executes the code. If issues arise, the process can repeat until resolved. Shaded circles represent steps that may be repeated multiple times.

Capable, conversable, and customizable agents – integrating LLMs, humans, and tools

AutoGen agents have capabilities enabled by LLMs, humans, tools, or a mix of those elements. For example:

One can easily configure the usage and roles of LLMs in an agent (automated complex task solving by group chat) with advanced inference features (e.g., optimize performance with inference parameter tuning).
Human intelligence and oversight can be achieved through a proxy agent with different involvement levels and patterns (e.g., automated task solving with GPT-4 + multiple human users (opens in new tab)).
The agents have native support for LLM-driven code/function execution (e.g., automated task solving with code generation, execution and debugging (opens in new tab), use provided tools as functions (opens in new tab)).

One straightforward way of using built-in agents from AutoGen is to invoke automated chat between an assistant agent and a user proxy agent. As an example (Figure 3), one can easily build an enhanced version of ChatGPT + Code Interpreter + plugins, with a customizable degree of automation, usable in a custom environment and embeddable in a bigger system. It is also easy to extend their behavior to support diverse application scenarios, such as adding personalization and adaptability based on past interactions (e.g., automated continual learning (opens in new tab), teach agents new skills (opens in new tab)).

Figure 3 shows the details of a chat between an assistant agent and a user proxy agent to illustrate how AutoGen automates such chats, while seamlessly engaging humans or using tools as needed to complete complex tasks. — Figure 3. A user proxy agent and assistant agent from AutoGen can be used to build an enhanced version of ChatGPT + Code Interpreter + plugins. The assistant agent plays the role of an AI assistant like Bing Chat. The user proxy agent plays the role of a user and simulates users’ behavior such as code execution. AutoGen automates the chat between the two agents, while allowing human feedback or intervention. The user proxy seamlessly engages humans and uses tools when appropriate.

The agent conversation-centric design has numerous benefits, including that it:

Naturally handles ambiguity, feedback, progress, and collaboration.
Enables effective coding-related tasks, like tool use with back-and-forth troubleshooting.
Allows users to seamlessly opt in or opt out via an agent in the chat.
Achieves a collective goal with the cooperation of multiple specialists.

AutoGen supports automated chat and diverse communication patterns, making it easy to orchestrate a complex, dynamic workflow and experiment with versatility. Figure 4 illustrates a new game, conversational chess (opens in new tab), enabled by AutoGen. Figure 5 illustrates how AutoGen supports group chats (opens in new tab) between multiple agents using another special agent called the “GroupChatManager”.

Figure 4 displays two small chessboards side-by-side, with black and white chess pieces in various positions on each board showing a game in progress, plus a chat between two users, to illustrate how AI, human, or hybrid users can play conversational chess. — Figure 4. An example of a new application enabled by AutoGen: conversational chess (opens in new tab). It can support various scenarios, as each player can be an LLM-empowered AI, a human, or a hybrid of the two. It allows players to express their moves creatively, such as using jokes, meme references, and character-playing, making chess games more entertaining to players as well as observers.

Figure 5 shows three shaded boxes, each containing symbols that represent various agents, to illustrate how AutoGen enables dynamic group chats. Each box represents a different step in the three-step process. — Figure 5. Overview of how AutoGen enables dynamic group chats (opens in new tab) to solve tasks: We use a special agent called the Manager that repeats the following three steps—select a single speaker (in this case Bob), ask the speaker to respond, and broadcast the selected speaker’s message to all the other agents.

(opens in new tab)Getting started

AutoGen (opens in new tab) (in preview) is freely available as a Python package. To install it, run

pip install pyautogen

You can quickly enable a powerful experience with just a few lines of code:

import autogen
assistant = autogen.AssistantAgent("assistant")
user_proxy = autogen.UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Show me the YTD gain of 10 largest technology companies as of today.")
# This triggers automated chat to solve the task

Check examples for a wide variety of tasks: https://microsoft.github.io/autogen/docs/Examples/AutoGen-AgentChat (opens in new tab).

Next steps:

Use AutoGen in your LLM applications and provide feedback on Discord (opens in new tab)
Read about the research:
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
- Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

AutoGen is an open-source, community-driven project under active development (as a spinoff from FLAML (opens in new tab), a fast library for automated machine learning and tuning), which encourages contributions from individuals of all backgrounds. Many Microsoft Research collaborators have made great contributions to this project, including academic contributors like Pennsylvania State University and the University of Washington, and product teams like Microsoft Fabric and ML.NET. AutoGen aims to provide an effective and easy-to-use framework for developers to build next-generation applications, and already demonstrates promising opportunities to build creative applications and provide a large space for innovation.

Names of Microsoft contributors:

Chi Wang, Gagan Bansal, Eric Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, Ahmed Awadallah, Ryen White, Doug Burger, Robin Moeur, Victor Dibia, Adam Fourney, Piali Choudhury, Saleema Amershi, Ricky Loynd, Hamed Khanpour.

The post AutoGen: Enabling next-generation large language model applications appeared first on Microsoft Research.

Neural Graphical Models

September 20, 2023

by Brenda Potts Microsoft AI

This research paper was presented at the 17^th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (opens in new tab), a premier forum for advances in the theory and practice of reasoning under uncertainty.

ECSQARU Blog Hero:
Neural Graphical Models

In the field of reasoning under uncertainty, probabilistic graphical models (PGMs) stand out as a powerful tool for analyzing data. They can represent relationships between features and learn underlying distributions that model functional dependencies between them. Learning, inference, and sampling are operations that make graphical models useful for domain exploration.

In a broad sense, learning involves fitting the distribution function parameters from data, and inference is the procedure of answering queries in the form of conditional distributions with one or more observed variables. Sampling entails the ability to extract samples from the underlying distribution as defined by the graphical model. A common challenge with graphical model representations lies in the high computational complexity of one or more of these operations.

Various graphical models impose restrictions on the set of distributions or types of variables in the domain. Some graphical models work with continuous variables only (or categorical variables only) or place restrictions on the graph structure, for example, the constraint that continuous variables cannot be parents of categorical variables in a directed acyclic graph (DAG). Other restrictions affect the set of distributions the models can represent, for example, only multivariate Gaussian distributions.

In our paper, “Neural Graphical Models (opens in new tab),” presented at ECSQARU 2023 (opens in new tab), we propose Neural Graphical Models (NGMs), a new type of PGM that learns to represent the probability function over the domain using a deep neural network. The parameterization of such a network can be learned from data efficiently, with a loss function that jointly optimizes adherence to the dependency structure, given as input in the form of a directed or undirected graph, and fit to the data. Probability functions represented by NGMs are unrestricted by any of the common restrictions inherent in other PGMs. NGMs can handle various input types: categorical, continuous, images and embedding representations. They also support efficient inference and sampling.

Figure 1 - The image on the left shows an undirected network graph with five variables: x1, x2, x3, x4 and x5. The variable x3 is connected to all other variables, and x1 is directly connected to x3 and x4 only. The annotation next to the nodes indicates that the value of each variable is a function of the values of its neighbors. For example, the value of x1 is a function of x3 and x4, the value of x2 is a function of x3, and so on. On the right, we see a table representing the adjacency matrix for the same graph, with both rows and columns labeled with variables names from x1 to x5. The cells show either ones or zeros. The ones indicate a presence of an edge, for example in the cell on the intersection of the row labeled x1 and the column labeled x3. — Figure 1: Graphical view of NGMs: The input graph G (undirected) for given input data X. Each feature ( x_i=f_i(text{Nbrs}(x_i))) is a function of the neighboring features. For a DAG, the functions between features will be defined by the Markov Blanket relationship ( x_i=f_i(text{MB}(x_i))). On the right, the adjacency matrix represents the associated dependency structure S.

Figure 2 - The image shows a neural network. The input layer has five variables: x1, x2, …, x5, and the corresponding output layer has the same five variables. Between the input and output layers there is one hidden layer with six nodes. Some of the units in the input layer are connected to the units in the hidden layer, and some of the units in the hidden layer are connected to the units in the output layer. A careful examination shows that there is a path from a unit xi in the input layer to a unit xj in the output layer whenever there is an edge from the xi node to the xj node in the graph in Figure 1. Note that there are no self-paths, that is, paths from xi in the input layer to xi in the output layer. Some of the remaining neural network connections representing zeroed-out weights are shown in dashed black lines. — Figure 2: Neural view of NGMs: This is a neural network as a multitask learning architecture capturing nonlinear dependencies for the features of the undirected graph in Figure 1. The presence of a path from the input to the output features indicates a dependency between them. The dependency matrix between the input and output of the NN reduces to matrix product operation (S_{nn}=Pi_i|W_i|=|W_1|times|W_2|). Note that not all the zeroed-out weights of the MLP (in black-dashed lines) are shown for the sake of clarity.

Experimental validations for NGMs

In our paper (opens in new tab), we evaluate NGMs’ performance, inference accuracy, sensitivity to the input graph, and ability to recover the input dependency structure when trained on both real and synthetic data: Infant mortality data (opens in new tab) from the Centers for Disease Control and Prevention (CDC), synthetic Gaussian Graphical model data, and lung cancer data from Kaggle.

The infant mortality dataset (opens in new tab) describes pregnancy and birth variables for all live births in the US and, in instances of infant death before the first birthday, the cause of death. We used the latest available data, which includes information about 3,988,733 live births in the US during 2015. It was particularly challenging to evaluate the inference accuracy of NGMs using this dataset due to the (thankfully) rare occurrence of infant deaths during the first year of life, making queries concerning such low probability events hard to accurately estimate.

We used the CDC data to evaluate the NGMs’ inference accuracy. We compared their prediction for four variables of various types: gestational age (ordinal, expressed in weeks), birth weight (continuous, specified in grams), survival until the first birthday (binary) and the cause of death. We used the categories of “alive,” the 10 most common causes of death, or “other” for the less common causes. Here, “alive” was indicated for 99.48% of infants. We also compared the performance of logistic regression, Bayesian networks, Explainable Boosting Machines (EBM), and NGMs. In case of NGMs, we trained two models: one using the Bayesian network graph and one using the uGLAD graph.

Our results demonstrate that NGM are significantly more accurate than logistic regression, more accurate than Bayesian networks, and on par with EBM models for categorical and ordinal variables. They particularly shine when predicting very low probability categories for multi-valued variable cause of death, where, in contrast most models (such as both PGMs and classification models) typically struggle. Note that while we need to train a separate LR and EBM model for each outcome variable evaluated, all variables can be predicted within one trained NGM model. Interestingly, the two NGM models show similar accuracy results despite the differences in the two dependency structures used in training.

We believe that NGMs are an interesting amalgam of the deep learning architectures’ expressivity, and PGMs’ representation capabilities and can be applied in many domains, given that they place no restrictions on input types and distributions. We encourage you to explore NGMs and take advantage of the ability to work with a wider range of distributions and inputs. You can access the code for Neural Graphical Models on GitHub (opens in new tab).

The post Neural Graphical Models appeared first on Microsoft Research.

Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies

September 19, 2023

by Brenda Potts Microsoft AI

DeepSpeed4Science Initiative - graphic with 6 icons

Introduction

In the next decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. In line with Microsoft’s mission to empower every person and every organization on the planet to achieve more, the DeepSpeed (opens in new tab) team at Microsoft is responding to this opportunity by launching a new initiative called DeepSpeed4Science (opens in new tab), aiming to build unique capabilities through AI system technology innovations to help domain experts to unlock today’s biggest science mysteries.

The DeepSpeed (opens in new tab) system is an industry leading open-source AI system framework, developed by Microsoft, that enables unprecedented scale and speed for deep learning training and inference on a wide range of AI hardware. Figure 1 demonstrates our basic approach to this new initiative. By leveraging DeepSpeed’s current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). We work closely with internal and external teams who own AI-driven science models that represent key science missions, to identify and address general domain-specific AI system challenges. This includes climate science, drug design, biological understanding, molecular dynamics simulation, cancer diagnosis and surveillance, catalyst/material discovery, and other domains.

Figure 1: It is a three-tier diagram. From bottom to top wise (vertically), it describes our basic approach for executing DeepSpeed4Science initative. Bottom section represents the current three pillars of
the DeepSpeed framework, including training, inference and compression. The middle layer, which is what this particular blog is about, is creating a new set of AI system technologies that are beyond generic large language model support, tailored for accelerating scientific discoveries and addressing their complexity. The very top layer represents gemera; AI-driven science models across different domains, which can be supported by DeepSpeed4Science software support. — Figure 1: DeepSpeed4Science approach: developing a new set of AI system technologies that are beyond generic large language model support, tailored for accelerating scientific discoveries and addressing their complexity.

Our long-term vision is to develop DeepSpeed4Science into a new platform and a unified repository for sharing advanced AI system technologies that support scientific discoveries. DeepSpeed4Science is designed to be inclusive, echoing Microsoft’s AI for Good commitment. That is reflected in the initiative’s support for a diverse group of signature science models, representing some of the most critical AI for science investments. In this blog, we showcase how DeepSpeed4Science helps address two of their critical system challenges in structural biology research: (1) eliminating memory explosion problems for scaling Evoformer-centric protein-structure prediction models, and (2) enabling very-long sequence support for better understanding the evolutionary landscape of pandemic-causing viruses.

Our launch and key collaborators

The new system technologies enabled by DeepSpeed4Science can empower AI-driven scientific discoveries using signature models that represent a wide spectrum of efforts pushing the boundaries of science. Currently, DeepSpeed4Science is honored to support several key science models from Microsoft Research AI4Science (opens in new tab), Microsoft WebXT/Bing (opens in new tab) and U.S. DoE National Labs (opens in new tab).

Current Microsoft internal partnerships

Scientific Foundation Model (SFM), Microsoft Research AI4Science

Figure 2: This figure contains two peices. The top piece represents the general methodology of buliding this scientific foundtaion model (SFM). The bottom section is a GIF that illustrates one important apporach that has been developed by Microsoft on protein structure prediction through Distributional Graphormer. Unlike the other protein prediction methods on the market, Distributional Graphormer claims that molecules are not rigid, rather they are dynamic that can adopt different structures with different probabilities at equilibrium. Distributional Graphormer is the first computational method that can predict equilibrium distribution of molecules by advanced generative AI technology.

Scientific foundation model (SFM) aims to create a unified large-scale foundation model to empower natural scientific discovery by supporting diverse inputs, multiple scientific domains (e.g., drugs, materials, biology, health, etc.) and computational tasks. The DeepSpeed4Science partnership will provide new training and inference technologies to empower the SFM team’s continuous research on projects like Microsoft’s new generative AI methods, such as Distributional Graphormer.

ClimaX, MSR AI4Science

Figure 3: The diagram of a foundation model for weather modeling is shown here. Our changing climate is producing more frequent extreme weather events. To mitigate the negative effects, it is increasingly important to predict where these events will occur. ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. It can absorb many different datasets with different variables and resolutions, potentially improving weather forecasting. — Figure 3: ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks.

Our changing climate is producing more frequent extreme weather events. To mitigate the negative effects, it is increasingly important to predict where these events will occur. ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. It can absorb many different datasets with different variables and resolutions, potentially improving weather forecasting. DeepSpeed4Science is creating new system supports and acceleration strategies for ClimaX for efficiently pretraining/finetuning bigger foundation models while handling very large high-resolution image data (e.g., tens to hundreds of petabytes) with long sequences.

AI Powered Ab Initio Molecular Dynamics (AI²MD), MSR AI4Science

Figure 4:This animated figure illustrates one million steps of a molecular dynamics simulation, e.g., RBD-protein interacts with protein inhibitor. Simulations like this are efficient enough to generate trajectories long enough to observe chemically significant events. — Figure 4: One million steps of molecular dynamics simulation: RBD-protein interacts with protein inhibitor.

This project simulates the dynamics of large (million-atom) molecular systems with near ab initio accuracy using AI-powered force field models while maintaining the efficiency and scalability of classical molecular dynamics. The simulations are efficient enough to generate trajectories long enough to observe chemically significant events. Typically, millions or even billions of inference steps are required for this process. This poses a significant challenge in optimizing the inference speed of graph neural network (GNN)+ LLM models, for which DeepSpeed4Science will provide new acceleration strategies.

Weather from Microsoft Start, Microsoft WebXT/Bing

Figure 5: This figure shows Microsoft Start precipitation nowcast application on Bing, i.e., every 4 minutes for the next 4 hours. Weather from Microsoft Start provides precise weather information to help users make better decisions for their lifestyles, health, jobs and activities – including accurate 10-day global weather forecasts updated multiple times every hour. — Figure 5: Microsoft Start precipitation nowcast (every 4 minutes for the next 4 hours).

Weather from Microsoft Start (opens in new tab) provides precise weather information to help users make better decisions for their lifestyles, health, jobs and activities (opens in new tab) – including accurate 10-day global weather forecasts updated multiple times every hour. Previously, Weather from Microsoft Start benefited from DeepSpeed technologies to accelerate their multi-GPU training environments. Currently, DeepSpeed4Science is working with the WebXT weather team to further enhance Microsoft Weather services with cutting-edge features and improvements.

Current external collaborators

DeepSpeed4Science’s journey started with two pioneering LLM-based AI models for structural biology research: OpenFold (opens in new tab) from Columbia University, an open-sourced high-fidelity protein structure prediction model; and GenSLMs (opens in new tab) from Argonne National Laboratory (opens in new tab), an award-winning genome-scale language model (opens in new tab) for learning the evolutionary landscape of SARS-CoV-2 (COVID-19) genomes. As the featured showcases for this release, they represent two common AI system challenges facing today’s AI-driven structural biology research. We will discuss how DeepSpeed4Science empowered their scientific discovery in the next section.

Additionally, DeepSpeed4Science has recently expanded its scope to support a more diverse range of science models. For example, in our work with Argonne on training a trillion-parameter science model on Aurora Exascale system (opens in new tab), DeepSpeed4Science technologies will help them reach the performance requirements and scalability needed for this critical mission. Furthermore, by collaborating with Oak Ridge National Lab (opens in new tab) and National Cancer Institute (NCI) (opens in new tab) on cancer surveillance, DeepSpeed4Science will help enable high-fidelity extraction and classification of information from unstructured clinical texts for the MOSSAIC project (opens in new tab). DeepSpeed4Science technologies will also be adopted by Brookhaven National Laboratory (opens in new tab) to support development of a large digital twin model for clean energy research by using LLMs to produce more realistic simulation data. You can find more detailed information about our external colleagues and their science missions at DeepSpeed4Science (opens in new tab).

Partnership showcases

Showcase (I): DeepSpeed4Science eliminates memory explosion problems for scaling Evoformer-centric structural biology models via DS4Sci_EvoformerAttention

Figure 6: The top figure illustrates the prediction demonstration from AlphaFold2 and OpenFold against the baseline experiemental result. OpenFold is a community reproduction of DeepMind’s AlphaFold2 that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (shown as the bottom figure), and developed new protein folding systems. The bottom figure demonstrates OpenFold's predictions for PDB chain 7B3A_A as the model trains.

OpenFold (opens in new tab) is a community reproduction of DeepMind’s AlphaFold2 (opens in new tab) that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (Figure 6), and developed new protein folding systems.

Figure 7: It shows the peak memory requirement for training variants of the multiple sequence alignment (MSA) attention kernels (with bias) with the maximum possible training sample dimension in OpenFold. (Left) The original OpenFold implementation with EvoformerAttention used in AlphaFold2. The memory explosion problems in training/inference for these types of protein structure prediction models are common. Particularly, state-of-the-art FlashAttention cannot effectively support such science attention variants. (Right) A new solution from DeepSpeed4Science called DS4Sci_EvoformerAttention significantly reduces OpenFold’s peak memory requirement for training by 13X without accuracy loss. — Figure 7: Peak memory requirement for training variants of the multiple sequence alignment (MSA) attention kernels (with bias) with the maximum possible training sample dimension in OpenFold. (Left) The original OpenFold implementation with EvoformerAttention used in AlphaFold2. The memory explosion problems in training/inference for these types of protein structure prediction models are common. Particularly, state-of-the-art FlashAttention cannot effectively support such science attention variants. (Right) A new solution from DeepSpeed4Science called DS4Sci_EvoformerAttention significantly reduces OpenFold’s peak memory requirement for training by 13X without accuracy loss.

While OpenFold does include performance and memory optimizations using state-of-the-art system technologies, training AlphaFold2 from scratch is still computationally expensive. The model at the current stage is small in absolute terms, with just 93 million parameters, but it contains several custom attention variants that manifest unusually large activations. During the “finetuning” phase of a standard AlphaFold2 training run, the logit tensor produced in just one of these variants–one designed to attend over the deep protein MSAs fed to the model as input–is in excess of 12GB in half precision alone, dwarfing the peak memory requirements of comparably sized language models. Even with techniques like activation checkpointing and DeepSpeed ZeRO optimizations, this memory explosion problem heavily constrains the sequence lengths and MSA depths on which the model can be trained. Furthermore, approximation strategies can significantly affect the model accuracy and convergence, while still resulting in memory explosion, shown as the left bar (orange) in Figure 7.

To address this common system challenge in structural biology research (e.g., protein structure prediction and equilibrium distribution prediction), DeepSpeed4Science is addressing this memory inefficiency problem by designing customized exact attention kernels for the attention variants (i.e., EvoformerAttention), which widely appear in this category of science models. Specifically, a set of highly memory-efficient DS4Sci_EvoformerAttention kernels enabled by sophisticated fusion/tiling strategies and on-the-fly memory reduction methods, are created for the broader community as high-quality machine learning primitives. Incorporated into OpenFold, they provide a substantial speedup during training and dramatically reduce the model’s peak memory requirement for training and inference. This allows OpenFold to be experimented with bigger and more complex models, and longer sequences, and trained on a wider spectrum of hardware. Detailed information about this technology can be found at DeepSpeed4Science (opens in new tab).

Showcase (II): DeepSpeed4Science enables very-long sequence support via both systematic and algorithmic approaches for genome-scale foundation models (e.g., GenSLMs)

Figure 8. The dynamic figure dipicts GenSLMs, 2022 ACM Gordon Bell Winning COVID Model (a 25B/33B dense model based on GPT-NeoX). It is used to learn the latent space that describes biologically meaningful properties for SARS-CoV-2 genomes. This GIF is visualizing an important protein family, malate dehydrogenase, and viewing a projection of the latent space colored by important features such as sequence length and GC content (the ratio of the content of the nucleic acids guanine and cytosine in comparison to adenine and thymine. It measures the ability of a DNA strand to withstand heat). — Figure 8: GenSLMs: 2022 ACM Gordon Bell Winning COVID Model (a 25B/33B dense model based on GPT-NeoX). It is used to learn the latent space that describes biologically meaningful properties for SARS-CoV-2 genomes. This GIF is visualizing an important protein family, malate dehydrogenase, and viewing a projection of the latent space colored by important features such as sequence length and GC content (the ratio of the content of the nucleic acids guanine and cytosine in comparison to adenine and thymine. It measures the ability of a DNA strand to withstand heat).

GenSLMs (opens in new tab), a 2022 ACM Gordon Bell award (opens in new tab) winning genome-scale language model from Argonne National Lab, can learn the evolutionary landscape of SARS-CoV-2 (COVID-19) genomes by adapting large language models (LLMs) for genomic data. It is designed to transform how new and emergent variants of pandemic-causing viruses, especially SARS-CoV-2, are identified and classified. GenSLM represents one of the first whole genome-scale foundation models which can generalize to other prediction tasks. A good understanding of the latent space can help GenSLMs tackle new domains beyond just viral sequences and expand their ability to model bacterial pathogens and even eukaryotic organisms, e.g., to understand things such as function, pathway membership, and evolutionary relationships. To achieve this scientific goal, GenSLMs and similar models require very long sequence support for both training and inference that is beyond generic LLMs’ long-sequence strategies like FlashAttention (opens in new tab). Through DeepSpeed4Science’s new designs, scientists can now build and train models with significantly longer context windows, allowing them to explore relationships that were previously inaccessible.

DeepSpeed - Figure 9. The two figures show the maximum sequence lengths of GenSLM models (25 billion parameters and 33 billion parameters) supported by different frameworks at different scales. The hardware profiled here are NVIDIA DGX nodes with eight 40G A100 GPUs per node. — Figure 9: Maximum sequence lengths of GenSLM models supported by different frameworks at different scales. The hardware profiled here are NVIDIA DGX nodes with eight 40G A100 GPUs per node.

Specifically, at system level, we release the newest Megatron-DeepSpeed (opens in new tab) framework for very-long sequence support along with other new optimizations (opens in new tab). Scientists can now train their large science models like GenSLMs with much longer sequences via a synergetic combination of our newly added memory optimization techniques on attention mask and position embedding, tensor parallelism, pipeline parallelism, sequence parallelism, ZeRO-style data parallelism and model state offloading. Figure 9 demonstrates that our new release enables the longest sequence length for GenSLMs’ 25B and 33B models by up to 12X and 14X, respectively, over the previous Megatron-DeepSpeed. In terms of supported sequence lengths, this new framework also significantly outperforms NVIDIA’s Megatron-LM by up to 9.8X and 9.1X for the 25B and 33B models, respectively. For example, GenSLMs’ 25B model can now be trained with a 512K sequence of nucleotides, compared to the Argonne team’s original 42K sequence length on 64 GPUs. This drastically improves model quality and scientific discovery scope with no accuracy loss. Additional support for domain scientists who prefer algorithmic strategies like relative position embedding techniques is also integrated in this new release (opens in new tab).

Summary and roadmap

We are very proud and excited to announce the DeepSpeed4Science initiative along with several R&D highlights and achievements. Starting today, we will host our new initiative at DeepSpeed4Science (opens in new tab), including information about our external colleagues, and current and future DeepSpeed4Science technology releases. One of our high-level goals is to generalize AI system technologies that broadly address the major system pain points for large-scale scientific discoveries. We hope scientists around the world will enjoy the new capabilities unlocked by DeepSpeed4Science through open-sourced software. We are looking forward to better understanding the AI system design challenges that block your discovery progress. We sincerely welcome your participation to help us build a promising AI4Science future. Please email us at deepspeed-info@microsoft.com (opens in new tab). We encourage you to report issues, contribute PRs, and join discussions on our DeepSpeed GitHub (opens in new tab) page.

DeepSpeed

Acknowledgements

Core DeepSpeed4Science Team:

Shuaiwen Leon Song (DeepSpeed4Science lead), Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Xiaoxia (Shirley) Wu, Masahiro Tanaka, Martin Cai, Adam Graham, Charlie Zhou, Yuxiong He (DeepSpeed team lead)

Our Founding Collaborators (in alphabetical order):

Argonne National Lab team: Rick Stevens, Cristina Negri, Rao Kotamarthi, Venkatram Vishwanath, Arvind Ramanathan, Sam Foreman, Kyle Hippe, Troy Arcomano, Romit Maulik, Maxim Zvyagin, Alexander Brace, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, Carla M. Mann, Michael Irvin, J. Gregory Pauloski, Logan Ward, Valerie Hayot, Murali Emani, Zhen Xie, Diangen Lin, Maulik Shukla, Weili Nie, Josh Romero, Christian Dallago, Arash Vahdat, Chaowei Xiao, Thomas Gibbs, Ian Foster, James J. Davis, Michael E. Papka, Thomas Brettin, Anima Anandkumar

AMD: Ivo Bolsen, Micheal Schulte, Bo Begole, Angela Dalton, Steve Reinhart, Ashwin Aji, Jalal Mahmud, Mahesh Balashibramanian

Brookhaven National Lab team: Adolfy Hoisie, Shinjae Yoo, Yihui Ren.

Columbia University OpenFold team: Mohammed AlQuraishi, Gustaf Ahdritz

Microsoft Research AI4Science team: Christopher Bishop, Bonnie Kruft, Max Welling, Tie-Yan Liu, Christian Bodnar, Johannes Brandsetter, Wessel Bruinsma, Chan Cao, Yuan-Jyue Chen, Peggy Dai, Patrick Garvan, Liang He, Elizabeth Heider, PiPi Hu, Peiran Jin, Fusong Ju, Yatao Li, Chang Liu, Renqian Luo, Qi Meng, Frank Noe, Tao Qin, Janwei Zhu, Bin Shao, Yu Shi, Wenlei Shi, Gregor Simm, Megan Stanley, Lixin Sun, Yue Wang, Tong Wang, Zun Wang, Lijun Wu, Yingce Xia, Leo Xia, Shufang Xie, Shuxin Zheng, Jianwei Zhu

Oakridge National Lab team: Prassana Balaprakash, Georgia Tourass

Princeton University: William Tang, Kyle Felker, Alexey Svyatkovskiy (Microsoft liaison)

Rutgers University: Hang Liu

WebXT Weather team: Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov

The post Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies appeared first on Microsoft Research.

Microsoft at ACM SIGCOMM 2023: Innovating the future of networking

September 14, 2023

by Brenda Potts Microsoft AI

Modern applications heavily rely on robust network infrastructure, requiring continuous innovation. In this evolving landscape, Microsoft is at the forefront, spearheading innovation efforts in networking and strengthening the foundational network infrastructure that underpins the cloud ecosystem. By investing in and enhancing this critical infrastructure, Microsoft not only ensures the resilience and scalability of cloud services but also lays the groundwork for the sophisticated and transformative applications that will continue to define the technological landscape.

ACM SIGCOMM (opens in new tab), the premier annual conference of the Association for Computing Machinery’s special interest group on data communication (opens in new tab) (SIGCOMM), is dedicated to the study of communication and computer networks. Microsoft was proud to be a Gold Sponsor of this year’s conference, publishing 10 papers and participating in the organizing committee. Dave Maltz (opens in new tab), technical fellow and corporate vice president of Azure Networking, served as one of the program committee chairs, helping to oversee the conference’s technical program. Additionally, we are proud to acknowledge the significant achievement of one of our youngest researchers, Siva Kakarla (opens in new tab), recognized as the ACM SIGCOMM Dissertation Award (opens in new tab) runner up for his thesis, “Formal Methods for a Robust Domain Name System (opens in new tab).” 

Microsoft also had a booth showcasing some of our latest technologies, including hollow core biber-based connectivity, SoNIC on smart switches, container networking, technologies for L3/L4-based DDoS protection, and technologies that we are building to extend the cloud into space—for both earth observation and satellite communication.

Paper highlights

The papers Microsoft published at SIGCOMM 2023 span a wide spectrum of networking domains, ranging from 5G and wide area networks (WAN) to enterprise networks. They also explore various aspects of networking, including traffic engineering, network offload strategies, and specialized network designs tailored for applications like gaming, video conferencing, and financial services.  

Here are some of the highlights:

Switchboard: Efficient Resource Management for Conferencing Services

Efficient resource management is crucial for conferencing services, such as Microsoft Teams, to balance user experience and cost-effectiveness. This involves optimizing the allocation of media processing servers, responsible for handling media streams during calls. Rahul Bothra, Rohan Gandhi, Ranjita Bhagwan, Venkat Padmanabhan, Rui Liang, Steve Carlson, Vinayaka Kamath, Sreangsu Acharyya, Ken Sueda, Somesh Chaturmohta, and Harsha Sharm introduce Switchboard, a significant advancement in resource management controllers. Switchboard is peak-aware, recognizing that resource costs vary with peak usage times and across time zones, allowing servers to serve calls during peak times and act as backups during off-peak hours. Additionally, it enhances efficiency by coordinating network and compute provisioning and application-aware resource allocation. Evaluation using Microsoft Teams data demonstrates that Switchboard reduces provisioning costs by up to 51 percent while maintaining or improving latency compared to existing solutions.

Resilient Baseband Processing in Virtualized RANs with Slingshot

In the realm of cellular networks, virtualized radio access networks (vRANs) are gaining prominence, replacing traditional specialized hardware with software on commodity servers. However, current vRAN setups lack resilience, making it challenging to implement failover mechanisms and upgrades without prolonged service interruptions. Nikita Lazarev, Tao Ji, Anuj Kalia, Daehyeok Kim, Ilias Marinos, Francis Y. Yan, Christina Delimitrou, Zhiru Zhang, and Aditya Akella propose Slingshot, an innovative system designed to seamlessly introduce resilience to the most critical layer of vRANs, the physical layer (PHY). Slingshot accomplishes this by employing novel techniques for real-time workload migration, incorporating fast RAN protocol middleboxes, and implementing real-time RAN failure detection. A key breakthrough in Slingshot’s design is its approach to treat transient disruptions from resilience events as akin to regular wireless signal impairments, using the inherent resilience of cellular networks to these occurrences. Experiments conducted on a cutting-edge 5G vRAN testbed demonstrate Slingshot’s capability to manage PHY failover without interrupting video conferencing and causing under 110 microseconds of disruption to a TCP connection. Furthermore, it enables seamless zero-downtime upgrades in vRAN deployments.

DBO: Response Time Fairness for Cloud-Hosted Financial Exchanges

When hosting financial exchanges in cloud environments, ensuring equal and predictable latency for all market participants is critical, especially in tasks like high-speed trading. Existing cloud deployments often struggle to maintain such fairness due to factors like congestion and varying network paths. In this paper, Prateesh Goyal, Eashan Gupta, Ilias Marinos, Chenxingyu Zhao, Radhika Mittal, and myself (Ranveer Chandra), tackle the issue arising from the lack of determinism in cloud networks, showing that achieving predictable or bounded latency isn’t a necessity to ensure fairness. Inspired by the concept of logical clocks in distributed systems, the paper introduces Delivery Based Ordering (DBO) as a novel approach to rectifying latency discrepancies among participants, helping ensure fairness. The evaluation of DBO, conducted both in a hardware testbed and a public cloud environment, demonstrates its feasibility in achieving guaranteed fairness and sustaining sub-100 microsecond latency, even at high transaction rates.

For the complete list of accepted publications by Microsoft researchers, please see the publications list on Microsoft at SIGCOMM 2023.

a group of researchers attending SIGCOMM 2023. They are standing in front of multiple buildings.

Learn about opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Networking, and other departments. Whether you’re a networking partner or researcher, we welcome your collaboration and exploration to advance computer networking and invite you to be part of the team crafting cutting-edge solutions for industry challenges. Review our open positions at the Microsoft Research website.

The post Microsoft at ACM SIGCOMM 2023: Innovating the future of networking appeared first on Microsoft Research.

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

September 14, 2023

by Brenda Potts Microsoft AI

MSR Podcast | AI Frontiers | Ahmed Awadallah

Episode 149 | Sept. 14, 2023

Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as healthcare and education, and its potential to benefit humanity.

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI. Awadallah discusses the shift in dynamics between model size and amount—and quality—of data when it comes to model training; the recently published paper “Orca: Progressive Learning from Complex Explanation Traces of GPT-4,” which further explores the use of large-scale AI models to improve the performance of smaller, less powerful ones; and the need for better evaluation strategies, particularly as we move into a future in which Awadallah hopes to see gains in these models’ ability to continually learn.

Learn more:

Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Publication, June 2023
Textbooks Are All You Need II: phi-1.5 technical report
Publication, September 2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Publication, August 2023
LIDA: Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
Publication, March 2023
AI Explainer: Foundation models and the next era of AI
Microsoft Research blog and video, March 2023
AI and Microsoft Research
Learn more about the breadth of AI research at Microsoft

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more inspired to work in the field than right now. The release of GPT-4 was a watershed moment in the pursuit of artificial intelligence, and yet progress continues to accelerate. The latest large-scale AI models and the systems they power are continuing to exhibit improvements in reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’m sharing conversations with fellow researchers about the latest developments in large-scale AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.

Today, I’ll speak with Ahmed Awadallah. Ahmed is a Senior Principal Researcher at Microsoft Research in Redmond. Much of his work focuses on machine learning, helping to create foundation models that excel at key tasks while using less compute and energy. His work has been at the leading edge of recent progress in AI and gives him a unique perspective on where it will go next.

[MUSIC FADES]

All right, Ahmed, let’s dive right in. Among other things, I find that people are hungry to understand the drivers of the progress we’re seeing in AI. Over these last few years when people like you or I have tried to explain this, we’ve often pointed to some measure of scale. You know, I know many times as I’ve given talks in AI, I’ve shown plots that feature some kind of up-and-to-the-right trend in scale over time—the increasing size of the AI models we’re training, the increasing size of the datasets we’re using to train them on, or even the corresponding increase in the overall compute budget. But when you double-click into this general notion of scale related to large AI models, what gets exposed is really a rapidly evolving frontier of experimental science. So, Ahmed, I’m going to start with a big question and then we can kind of decompose it from there. As someone at the forefront of all of this, how has your understanding of what’s driving progress in AI changed over this last year?

AHMED AWADALLAH: Thanks, Ashley. That’s a very good question. And the short answer is it’s changed a lot. I think I have never been learning as much as I have been throughout my career. Things are moving really, really fast. The progress is amazing to witness, and we’re just learning more and more every day. To your point, for quite some time, we were thinking of scale as the main driver of progress, and scale is clearly very important and necessary. But over the last year, we have been also seeing many different things. Maybe the most prominent one is the importance of data being used for training these models. And that’s not very separate from scale, because when we think about scale, what really matters is how much compute we are spending in training these models. And you can choose to spend that compute in making the model bigger or in training it on more and more data, training it for longer. And it has been over the past few years a lot of iterations in trying to understand that. But it has been very clear over the last year that we were, in a sense, underestimating the value of data in different ways: number one, in having more data but even more important, the quality of the data, having cleaner data, having more representative data, and also the distribution or the mixing of the data that we are using. Like, for example, one of the very interesting things we have witnessed maybe over the last year to year and a half is that a lot of the language models are being trained on text and code. And surprisingly, the training on code is actually helping the model a lot—not just in coding tasks but in normal other tasks that do not really involve coding. More importantly, I think one of the big shifts last year in particular—it has been happening for quite some time but we have been seeing a lot of value for it last year—is that there are now like two stages of training these models: the pretraining stage, where you are actually training the language model in an autoregressive manner to predict the next word. And that just makes it a very good language model. But then the post-training stage with the instruction tuning and RLHF (reinforcement learning from human feedback) and reward models, using a very different form of data; this is not self-supervised, freely available data on the internet anymore. This is human-generated, human-curated, maybe a mixture of model- and human-curated data that’s trying to get the model to be better at very specific elements like being more helpful or being harmless.

LLORENS: There’s so much to unpack even in that, in that short answer. So let’s, let’s dig in to some of these core concepts here. You, you teed up this notion of ways to spend compute, you know, ways to spend a compute budget. And one of the things you said was, you know, one of the things we can do is make the model bigger. And I think to really illustrate this concept, we need to, we need to dig in to what that means. One, one concept that gets obfuscated there a little bit is the architecture of the model. So what does it mean to make the model bigger? Maybe you can tell us something about, you know, how to think about parameters in the model and how important is architecture in that, in that conversation.

AWADALLAH: So most of the progress, especially in language and other domains, as well, have been using the transformer model. And the transformer model have been actually very robust to change over the years. I don’t … I think a lot … I’ve asked a lot of experts over the years whether they had expected the transformer model to be still around five, six years later, and most of them thought we would have something very different. But it has been very robust and very universal, and, yes, there have been improvements and changes, but the core idea has still been the same. And with dense transformer models, the size of the model tends to be around the number of layers that you have in the model and then the number of parameters that you have in each layer, which is basically the depths and the widths of the model. And we have been seeing very steady exponential increase in that. It’s very, it’s very interesting to think that just like five years ago when BERT came up, the large model was like 300-something million parameters and the smaller one was 100 million parameters. And we consider these to be really large ones. Now that’s a very, very small scale. So things have been moving and moving really fast in making these models bigger. But over the time, there started to be an understanding being developed of how big should the model be. If I were to invest a certain amount of compute, what should I do with that in terms of the model size and especially on how it relates to the data side? And, perhaps, one of the most significant efforts there was the OpenAI scaling laws, which came up in 2020, late 2020, I think. And it was basically saying that if you are … if you have 10x more compute to spend, then you should dedicate maybe five of that … 5x of that to making the model bigger—more layers, more width—and maybe 2x to making the data bigger. And that translated to … for, like say, GPT-3-like model being trained on almost 300 billion tokens, and for quite some time, the 300 billion tokens was stuck, like it became the standard, and a lot of people were using that. But then fast-forward less than two years later came the second iteration of the scaling laws, the Chinchilla paper, where the, the recommendation was slightly different. It was like we were not paying enough attention to the size of the data. Actually, you should now think of the data and the size as equally … and the size of the model … as equally important. So if you were to invest in X more, you should just split them evenly between bigger models and more data. And that was quite a change, and it actually got all the people to pay more attention to the data. But then fast-forward one more year, in 2023—and maybe pioneered mostly with the Llama work from Meta and then many, many others followed suit—we started finding out that we don’t have to operate at this optimal point. We can actually push for more data and the model will continue to improve. And that’s interesting because when you are thinking about the training versus the deployment or the inference parts of the life cycle of the model, they are actually very different. When you are training the model, you would like the model to learn to generalize as best as possible. When you are actually using the model, the size of the model becomes a huge difference. I actually recall an interesting quote from a 2015 paper by Geoff Hinton and others. That’s the paper that introduced the idea of distillation for neural networks. Distillation was there before from the work of, of Rich Caruana, our colleague here at Microsoft, and others. But in 2015, there was this paper specifically discussing distilling models for neural network models, and one of the motivating sentences at the very beginning of the paper was basically talking about insects and how insects would have different forms throughout their life cycles. At the beginning of their life, they are optimized for extracting energy and nutrients from the environment, and then later on, in their adult form, they have very different forms as optimized for flying and traveling and reproduction and so on and so forth. So that, that analogy is very interesting here because like you can think about the same not just in the context of distillation, as this paper was describing, but just for pretraining the models in general. Yes, the optimal point might have been to equally split your compute between the data and the size, but actually going more towards having more and more data actually is beneficial. As long as the model is getting better, it will give you a lot more benefit because you have a smaller model to use during the inference time. And we would see that with the latest iteration of the Llama models, we are now seeing models as small as 7 billion parameters being trained on 1 to 2 trillion tokens of data, which was unheard before.

LLORENS: Let’s talk a bit more about evaluating performance. Of course, the neural scaling laws that you referenced earlier really predict how the performance of a model on the task of next word prediction will improve with the size of the model or the size of the data. But of course, that’s not what we really care about. What we’re really after is better performance on any number of downstream tasks like reasoning, document summarization, or even writing fiction. How do we predict and measure performance in that broader sense?

AWADALLAH: Yeah, that’s a very good question. And that’s another area where our understanding of evaluating generative models in general has been challenged quite a bit over the last year in particular. And I think one of the areas that I would recommend to spend a lot of time working on right now is figuring out a better strategy around evaluating generative language models. We … this field has been very benchmark driven for many, many years, and we have been seeing a lot of very well-established benchmarks that have been helping the community in general make a lot of progress. We have seen leaderboards like GLUE and SuperGLUE, and many, many others play a very important role in the development of pretrained models. But over the last year, there has been a lot of changes. One is that these benchmarks are being saturated really, really quickly. There was … this paper that I was reading a few, reading a few months back talking about how we went from times where benchmarks like Switchboard and MNIST for speech and image processing lasted for 10 to 20 years before they get saturated to times where things like SQuAD and GLUE and SuperGLUE are getting saturated in a year or two to now where many of the benchmarks just get like maybe two or three submissions and that’s it. It gets saturated very quickly after that. BIG-Bench is a prime example of that, where it was like a collaborative effort, over 400 people coming together from many different institutions designing, a benchmark to challenge language models. And then came GPT-4, and we’re seeing that it’s doing really, really, really well, even in like zero-shot and, and, and few-shot settings, where the tasks are completely new to the models. So the model out of the box is basically solving a lot of the benchmarks that we have. That’s an artifact of the significant progress that we have been seeing and the speed of that progress, but it’s actually making that, that answer to that question even harder. But there’s another thing that’s making it even harder is that the benchmarks are giving us a much more limited view of the actual capabilities of these models compared to what they can actually do, especially models like GPT-4. The, the breadth of capabilities of the model is beyond what we had benchmarks to measure it with. And we have seen once it was released, then once people started interacting with it, there are so many experiences and so many efforts just thinking about what can we do with that model. Now we figured out that it can do this new task; it can do that new task. I can use it in this way that I didn’t think about before. So that expansion in the surface of capabilities of the model is making the question of evaluating them even, even harder and, and moving forward, I think this would be one of the most interesting areas to really spend time on.

LLORENS: Why don’t we talk a bit about a paper that you recently published with some Microsoft Research colleagueS called “Orca: Progressive Learning from Complex Explanation Traces of GPT-4.” And there’s a couple of, of concepts that we’ve been talking about that I want to pull through to, to a discussion around, around this work. One is the idea of quality of data. And so it would be great to hear, you know, some of the intuitions around … yeah, what, what drove you to focus on data quality versus, you know, number of parameters or number of tokens? And then we can also come back to this notion of benchmarks, because to publish, you have to pick some benchmarks, right? [LAUGHS] So, so first, why don’t we talk about the intuitions behind this paper and what you did there, and then I’d love to understand how you thought through the process of picking benchmarks to evaluate these models.

AWADALLAH: Yeah, so, so in this paper, we were basically thinking about like … there has been a lot of work actually on thinking about how do we have a very powerful model and use it to improve a less powerful model. This is not a new concept. It has been there forever, and I mentioned the Hinton et al. paper on distillation, one of the pioneer papers applying that to neural networks. And over time, this field actually continued getting better and better. And the way the large, more powerful models were used just continued evolving. So people were using the logits generated by the model and then maybe looking at intermediate layers and their output, maybe looking at attention maps and trying to map that between the models and coming up with more and more complex ways of distilling information from the powerful model to improve a less powerful model. But with models like GPT-4, we were thinking that GPT-4 is so good that you can actually start thinking about different ways of having a model teaching another model. And in that particular case, the idea was, can we actually have the powerful model explain in step by step how to do the task, and can we actually have a smaller model learn from that? And how far can this actually help the smaller one? A big part of this has to do with the data quality but also with the teacher model quality. You wouldn’t be able to … and this gets us into the whole notion of synthesized data and the role of synthesized data can play in making models better. Models like GPT-4, the level of capability where you could actually generate a lot of synthetic data at a very high quality comparable in some cases to what you’d get from a human, better in some cases than what you could get from a human. And even more than that, when you are working with a model like GPT-4, there has been a lot of work over the last few months demonstrating that you can even get the model to be a lot better by having the model reflect on what it’s doing, having the model critique what it’s doing and try to come up with even corrections and improvements to its own generation. And once you have this going, you see that you can actually create very high-quality synthetic data in so many ways, mostly because of the quality of the model but also because of like these different ways of generating the data on top of the model. And then it was really an experiment of how far can another model learn from these models. And by the way—and there is … we’re seeing some work like that, as well—it doesn’t even have to be a different model. It can be the same model improving itself. It can be the same model giving feedback to itself. That coincided with actually us having, having … we have been spending a lot of time thinking about this idea of learning from feedback or like continual improvement. How can we take a language model and continue to improve it based on interaction, based on feedback? So we started connecting these two concepts and basically thinking of it like the powerful model is just giving feedback to our much less powerful model and trying to help it improve across certain dimensions. And that’s where that line of work started. And what we were finding out is that you can actually have the more powerful model teach a smaller model. It would have definitely much narrower capabilities than the bigger model because like by virtue of this training cycle, you are just focused on teaching it particular concepts. You cannot teach it everything that the larger model can do. But also because this is another example of this like post-training step, like this model has already been pretrained language model and it’s always limited by the basic capabilities that it has. So, yes, the large language model can teach it a little bit more, but it will always be limited by that.

LLORENS: Now you mentioned … you’ve sketched out now the idea of using a powerful general-purpose model through some process of distillation to train a, a smaller, more special, more specialized model. And in the paper, you, you and your colleagues offer a number of case studies. So can you, can you pick one? Give, give us, you know, give us an example of a specialized domain and the way that you utilize GPT-4 to accomplish this training and what the performance outcome was.

AWADALLAH: Yeah, actually, when we were working on this paper, the team was thinking that what capability should we try to focus on to, to demonstrate that the small model can improve from, from the guidance of the much more powerful model. And we were thinking it would be very cool if we can demonstrate that the small model can get better at reasoning, because reasoning has been one of the capabilities that have been clearly emerging with larger and larger models, and models like GPT-4 demonstrate the level of reasoning that we have never seen with any of our systems before. So we were thinking can we … can, can GPT-4 help actually get the smaller model to be better at reasoning. And that had a lot of implications on the selection of what datasets to use for, for creating the synthetic data. In this particular paper, by the way, we’re not, we’re not using GPT-4 to answer the questions. We already have the questions and the answers. We are just asking GPT-4 to explain it in step by step. This is similar to what we have been seeing with chain-of-thought reasoning, chain-of-thought prompting, and other different prompting techniques showing that if you actually push the language model to go step by step, it can actually do a lot better. So we are basically saying, can we have these explanations and step-by-step traces and have them help the smaller language model learn to reason a little bit better. And because of that, actually—and this goes back to your earlier questions about benchmarks—in this particular paper, we chose two main benchmarks. There were more than two, but like the two main benchmarks where BIG-Bench Hard and AGIEval. BIG-Bench Hard is a 23 subset of BIG-Bench that we were just talking about earlier, and a lot of the tasks are very heavy on reasoning. AGIEval is a set of questions that are SAT-, LSAT-, GRE-, and GMAT-type of questions. They are also very heavy on reasoning. The benchmarks were selected to highlight the reasoning improvement and the reasoning capability of the model. And we had, we had a bunch of use cases there, and you would see one of the common themes there is that there is actually … even before the use cases, if you look at the, the results, the reasoning ability as measured by these two benchmarks at least of the base model significantly improved. Still far behind the teacher. The teacher is much, much more powerful and there’s no real comparison, but still the fact that collecting synthetic data from a model like GPT-4 explaining reasoning steps could help a much smaller model get better at reasoning and get better by that magnitude was a very interesting finding. We were, we were quite a bit surprised, actually, by the results. We thought that it will improve the model reasoning abilities, but it actually improved it beyond what we expected. And again, this goes back to like imagine if we were … if we wanted to do that without a model like GPT-4, that would entail having humans generate explanations for a very large number of tasks and make sure that these explanations remain faithful and align with the answers of the question. It would have been a very hard task, and the type of annotator that you would like to recruit in order to do that, it would have been … even made it harder and slower. But having, having the capabilities of a model like GPT-4 is really what made it possible to do that.

LLORENS: You’ve, you’ve outlined now, you know, your experiments around using GPT-4 to train a smaller model, but earlier, you also alluded to a pretty compelling idea that maybe even a large, powerful model could, I guess, self-improve by generate, you know, performing a generation, critiquing itself, and then somehow guiding, you know, the parameter weights in a way that, that was informed by the critique. Is that, was that part of these experiments, or what … or, or is that … does that work? [LAUGHS] Have, have we … do we have experimental evidence of that?

AWADALLAH: Yeah, I think, I think that’s a very good question. That was really how we started. That was really what we were aiming and still trying to do. The value … we started off by asking that question: can we actually have a model self-improve, self-improve itself? From an experimental perspective, it was much easier to have a powerful model help a smaller model improve. But self-improvement is really what we, what got us excited about this direction from the beginning. There has been evidence from other work showing up over the last short period actually showing that this is actually a very promising direction, too. For example, one of the very interesting findings about these powerful models—I think that the term frontier models is being used to refer to them now—is that they have a very good ability at critiquing and verifying output. And sometimes that’s even better than their ability at solving the task. So you can basically go to GPT-4 and ask it to solve a coding question. Write a Python function to do something. And then you can go again to GPT-4 and ask it to look back at that code and see if there are any bugs in there. And surprisingly, it would identify bugs in its own generation with a very high quality. And then you can go back to GPT-4 again and ask it to improve its own generation and fix the bugs. And it does that. So we actually have a couple of experiments with that. One of them in a toolkit called LIDA that one of my colleagues here, Victor [Dibia], has been working on for some time. LIDA is a tool for visualizations, and you basically go there and submit a query. The query would be, say, create a graph that shows the trends of stocks over the last year. And it’ll actually go to the data basically, engineer Python code. The Python code, when compiled and executed, would generate a visualization. But then we were finding out that we don’t have to stop there. We can actually ask GPT-4 again to go back to that visualization and critique it, and it doesn’t have to be open critique. We can define the dimensions that we would like to improve on and ask GPT-4 to critique and provide feedback across these dimensions. Like it could be the readability of the chart. It could be, is the type of chart the best fit for the data? And surprisingly it does that quite well. And then that opens the door to so many interesting experiences where you can, after coming up with the initial answer, you can actually suggest some of these improvements to a human. Or maybe if you are confident enough, you just go ahead and apply them even without involving the human in the loop and you actually get a lot better. There was another experiment like that where another colleague of mine has been working on a library called AutoGen, which basically helps with these iterative loops on top of language models, as well as figuring out values of hyperparameters and so on and so forth. And the experiments were very similar. There was a notion there of like having a separate agent that the team refers to as a user proxy agent, and that agent basically has a criteria of what the user is trying to do. And it keeps asking GPT-4 to critique the output and improve the output up until this criteria is met. And we see that we get much, much better value with using GPT-4 this way. That cycle is expensive, though, because you have to iterate and go back multiple times. The whole idea of self-improvement is basically, can we literally distill that cycle into the model itself again so that as the model is being used and being asked to maybe critique and provide feedback or maybe also getting some critique and feedback from the human user, can we use that data to continue to improve the model itself?

LLORENS: It is pretty fascinating that these models can be better at evaluating a candidate solution to a task than generating a novel solution to the task. On the other hand, maybe it’s not so surprising. One of the things that’s hard about or one of the things that can be challenging is this idea of, you know, prompt engineering, by which I’m trying to specify a task for the, for the model to solve or for the AI system to solve. But if you think about it, the best I can do at specifying the task is to actually try my best to complete the task. I’ve now specified the task to the greatest extent that I possibly can. So the machine kind of has my best task specification. With that, that information, now it becomes a kind of maybe even in some cases a superhuman evaluator. It’s doing better than I can at evaluating my own work. So that’s kind of an interesting twist there. Back, you know, back to the Orca paper, one of the things that you wouldn’t have seen … you know, earlier in the talk, you, you harkened back to say a decade ago, when benchmarks lasted a long, a longer time, one of the things that we would not necessarily have seen in a paper from that era, you know, say the CNN era of AI, is, is, a, is a safety evaluation, you know, for a specialized object recognition model. But in the Orca paper, we do have a safety evaluation. Can you, you talk a little bit about the thought process behind the particular evaluations that you did conduct and, and why these are necessary in the first place in this era of AI?

AWADALLAH: Yeah, I think in this era of AI, this is one of the most important parts of the development cycle of any LLM—large or small. And as we were just describing, we are discovering abilities of these models as we go. So just as there will be a lot of emerging capabilities that are surprising and useful and interesting, this would also open the door to a lot of misuse. And safety evaluation is at least … is the least we can do in order to make sure that we understand how, how can this model be used and what are some of the possible harms or the possible misuses that can come from using these models? So I think, I think this is, this is now definitely should be a standard for any work on language models. And here we are not, we’re not really training a language model from scratch. This is more of like a post-training or a fine-tuning of an existing language model. But even for, for, for research like that, I think safety evaluation should be a critical component of that. And, yes, we did some, and we, we, we actually have a couple of paragraphs in the paper where we say we need to do a lot more, and we are doing a lot more of that right now. I think … what we did in the paper that … we focused on only two dimensions: truthfulness and toxicity. And we were basically trying to make sure that we are trying to see the additional fine-tuning and training that we do, is it improving the model across these dimensions or is it not? And the good news that it was actually improving it in both dimensions, at least with the benchmarks that we have tried. I, I think it was interesting that actually on the, on the toxicity aspect in particular, we found that this particular type of post-training is actually improving the base model in terms of its tendency to generate toxic or biased content. But I think a big part of that is that we, we’re using Azure APIs in part of the data cleaning and data processing, and Azure has invested a lot of time and effort in making sure that we have a lot of tools and classifiers for identifying unsafe content, so the training data, the post-training data, benefited from that, which ended up helping the model, as well. But to your point, I think this is a critical component that should go into any work related to pretraining or post-training or even fine-tuning in many cases. And we did some in the paper, but I think, I think there’s a lot more to be done there.

LLORENS: Can you talk a little bit more about post-training as distinct from pretraining? How that, how that process has evolved, and, and where you see it going from here?

AWADALLAH: I, I, I see a ton of potential and, and opportunity there actually. And pretraining is the traditional language model training as we have always done it. Surprisingly, actually, if you go back to … like I, I was … in, in one of the talks, I was showing like a 20-years-ago paper by Bengio et al. doing the language model training with neural networks, and we’re still training neural networks the same way, autoregressive next word prediction. Very different architecture, a lot of detail that goes into the training process, but we are still training them as a language model to predict the next word. In a big departure from that—and it started with the InstructGPT paper and then a lot of other work had followed—there was this introduction of other steps of the language model training process. The first step is instruction tuning, which is showing the model prompts and responses and asking it to … and training the model on these prompts and responses. Often these responses are originated by a human. So you are not just training the model to learn the language model criteria only anymore, you are actually training it to respond to a way the human would want it to respond. And this was very interesting because you could see that the language models are really very good text-completion engines. And at some time actually, a lot of folks were working on framing the task such that it looks like this text completion. So if you are doing classification, you would basically list your input and then ask a question where the completion of that question would be the class that you are looking for. But then the community started figuring out that you can actually introduce this additional step of instruction tuning, where now out of all the possible ways of completing a sentence like if I’m asking a question, maybe listing other similar questions is a very good way of completion. Maybe repeating that question with more details is another way of completion, or answering the question is a third way of completion, and all of them could be highly probable. The instruction tuning is basically teaching the model the way to respond, and a big part of that has to do with safety, as well, because you could demonstrate how we want the model to be helpful, how we want the model to be harmless, in this instruction-tuning step. But the instruction tuning step is only showing the model what to do. It’s not showing it what not to do. And this is where the RLHF step came in, the reinforcement learning from human feedback. What’s happening really is that instead of showing the model a single answer, we’re showing them a little more than one answer. And we are basically showing them only a preference. We’re basically telling the model Answer A is better than Answer B. It could be better for many reasons. We are just encoding our criteria of better into these annotations, and we are training a reward model first that basically it’s job is, given any response, would assign a scalar value to it on how good it is. And then we are doing the RLHF training loop, where the reward model is used to update the original model such that it learns what are better responses or not or worse responses and tries to align more with the better responses. The post-training is, as a concept, is very related and, and sometimes referred to also as alignment, because the way post-training has been mostly used is to align the model to human values, whether this be being helpful or being harmless.

LLORENS: Ahmed, as we, as we wrap up here, typically, I would ask something like, you know, what’s next for your research, and maybe you can tell us a little bit about what’s next for your research. [LAUGHS] But, but before you do that, I’d love to understand what, what key limitation you see in the current era of AI that you would … would be on your wish list, right, as something that maybe you and your team or maybe the broader field has accomplished in the next five years. What, what new capabilities would be on your wish list for AI over the next five years?

AWADALLAH: Yeah, given, given the progress, I would say even much shorter than five years.

LLORENS: Five months. [LAUGHS]

AWADALLAH: But I would say … actually the answer to the two questions are, are very similar. Actually, I think where we are with these models right now is much better than many people anticipated, and we are able to solve problems that we didn’t think we could solve before. One of the key capabilities that I would like to see getting better over the next, few months to a few years—hopefully more toward few months—is the ability of the model to continue to learn. This like continual learning loop where the model is learning as it interacts with the humans. The model is reflecting on past experiences and getting better as we use it, and maybe also getting better in an adaptive way. Like we sometimes use this term adaptive alignment, where we are basically saying we want the model actually to continue to align and continue to align in the way it behaves across multiple dimensions. Like maybe the model will get more personal as I use it, and it will start acting more and, and behaving more in a way I want it to be. Or maybe I am developing a particular application, and for that application, I want the model to be a lot more creative or I want the model to be a lot more grounded. We can do some of that with prompting right now, but I think having more progress along this notion of continual learning, lifelong learning … this has been a heavily studied subject in machine learning in general and has been the holy grail of machine learning for many, many, many years. Having a model that’s able to continue to learn, continue to adapt, gets better every time you use it, so just when I use it today and I interact with it and it could learn about my preferences, and next time along, I don’t have to state these preferences again. Or maybe when it makes a mistake and I provide a feedback, next time along, it already knows that it had made that mistake and it already gives me a better solution.

LLORENS: That should have been the last question. But I think I have one more. That is, how will we know that the models are getting better at that, right? That’s a metric that’s sort of driven by interaction versus, you know, static evaluation. So how do you, how do you measure progress in adaptive alignment that way?

AWADALLAH: I think, I think that’s a very interesting point. And this actually ties this back with two concepts that we brought up earlier: the evaluation side and the safety side. Because from the evaluation perspective, I do think we need to move beyond static benchmark evaluation to a more dynamic human-in-the-loop evaluation, and there’s already been attempts and progress at that just over the past few months, and there is still a lot more to do there. The evaluation criteria will not also be universal. Like there will be a lot … like a lot of people talk about the, let’s say, fabrications—the models making up information, facts. Well, if I am using the model to help me write fictional stories, like this becomes a feature; it’s not a bug. But if I’m using the model to ask questions, especially in the high-stakes scenario, it becomes a very big problem. So having a way of evaluating these models that are dynamic, that are human-in-the-loop, that are adaptive, that aligns with objectives of how we are using the models will be a very important research area, and that ties back to the safety angles, as well, because if I … if we are barely … we’re, we’re … everybody is working really hard to try to understand the safety of the models after the models are being trained and they are fixed. But what if the models continue to improve? What if it’s continuing to learn? What if it’s learning things from me that are different than what it’s learning from you? Then that notion of alignment and safety and evaluation of that becomes also a very open and interesting question.

LLORENS: Well, look, I love the ambition there, Ahmed, and thanks for a fascinating discussion.

AWADALLAH: Thank you so much, Ashley.

The post AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens appeared first on Microsoft Research.

Collaborators: Holoportation communication technology with Spencer Fowers and Kwame Darko

On-device quantization latency analysis

Finding diverse, efficient quantized models with SpaceEvo

Potential for sustainable and efficient computing

AI in the real world

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

The promise and challenge of interactive AI “copilots”

A new multimodal interactive dataset

Towards proactive AI assistants

Looking forward

Contributors

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Inspiring research in the era of AI

Integrating ChatGPT into English as a Foreign Language (EFL) Writing Education – Korea Advanced Institute of Science and Technology (KAIST)

Lightweight Adaptation of LLMs for Healthcare Applications – Stanford University

AI-Based Traffic Monitoring System using Physics-Informed Neural Networks and GPT Models – North Carolina A&T State University

Forging New Horizons in Astronomy – Harvard University

Expanding AFMR

NEW RESEARCH

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

AI Explainer: Foundation models ​and the next era of AI

NEW RESEARCH

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

NEW RESEARCH

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

AI and Microsoft Research

Capable, conversable, and customizable agents – integrating LLMs, humans, and tools

(opens in new tab)Getting started

Next steps:

Names of Microsoft contributors:

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Experimental validations for NGMs

Introduction

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Our launch and key collaborators

Current Microsoft internal partnerships

Current external collaborators

Partnership showcases

Summary and roadmap

Acknowledgements

Core DeepSpeed4Science Team:

Our Founding Collaborators (in alphabetical order):

Microsoft Research Summit 2022

Paper highlights

Learn about opportunities

Episode 149 | Sept. 14, 2023

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

AI Explainer: Foundation models and the next era of AI