GeForce NOW Brings the Heat With ‘World of Warcraft’

GeForce NOW Brings the Heat With ‘World of Warcraft’

World of Warcraft comes to the cloud this week, part of the 17 games joining the GeForce NOW library, with seven available to stream this week.

Plus, it’s time to get rewarded. Get a free in-game mount in Elder Scrolls Online starting today by opting into GeForce NOW’s Rewards program.

Heroes Rise to the Cloud

Dive into the immersive realms of World of Warcraft, including the latest expansion Dragonflight, the nostalgic journey of World of Warcraft Classic and the recently launched World of Warcraft Cataclysm Classic. These popular, massively multiplayer, online role-playing experiences from Blizzard Entertainment immerse players in legendary battles.

World of Warcraft: Dragonflight on GeForce NOW
Dragonriders fly best in the cloud.

Embark on a journey of endless adventure in the rich and dynamic universe of Azeroth in the latest modern expansion, World of Warcraft: Dragonflight. The expansive landscapes of the Dragon Isles are available to explore — even on the back of a fearsome dragon. The newly awakened Dracthyr Evokers are also available, World of Warcraft’s first-ever playable race-and-class combo. GeForce NOW Priority and Ultimate members can get immersed in the cinematic gameplay with support for RTX ON.

World of Warcraft Cataclysm Classic on GeForce NOW
Witness the return of Deathwing.

Face the return of Deathwing the Destroyer, whose violent emergence shatters and reshapes the continent of Azeroth. Journey into an era of fire and destruction in World of Warcraft Cataclysm Classic and usher in a new era for Azeroth. The updated game brings new dungeons and raids, fresh race and class combinations, and more.

World of Warcraft Classic on GeForce NOW
Azeroth awaits.

Whether a seasoned adventurer or a newcomer to the game, head to the Azeroth of yesteryear in World of Warcraft Classic and relive the experience of the game as it was upon its initial launch, with a few new upgrades. Explore the Eastern Kingdoms and Kalimdor, venture into iconic dungeons or engage in legendary player-vs-player battles.

Experience it all with a GeForce NOW membership, which means no waiting for downloads or games to update, even for the upcoming World of Warcraft expansion The War Within.

Mount Up

GeForce NOW members get access to rewards that enhance the gaming experience. This week The Elder Scrolls Online 10-year celebration continues with an in-game reward for GeForce NOW members.

New member reward on GeForce NOW
Manes flow freely in the cloud.

Mounts offer a great way to travel the world and provide a completely different experience to traveling on foot. This new free reward provides members with a trusty companion beyond the starter option. The mount has a sunny disposition, matching its vibrant, multihued coat. It’s an excellent horse for a new rider or one who regularly ventures into treacherous situations.

Members can claim their free mount starting today by opting into rewards and checking their email for instructions on how to redeem. Ultimate and Priority members can redeem starting today, while free members will be able to claim it starting May 31. It’s available until June 30, first come first served.

New Games, Assemble!

Capes on GeForce NOW
Turn-based strategy with a superhero twist.

Build a team of heroes and fight to take back the city in Capes, a turn-based strategy game from Daedlic Entertainment. Recruit, train and deploy a team to take back the city from the villains that hold it hostage. Level up heroes to gain access to new abilities and powerful upgrades — plus, each hero gains a unique team-up ability from each of their allies.

Check out the full list of new games this week:

  • The Rogue Prince of Persia (New release on Steam, May 27)
  • Capes (New release on Steam, May 29)
  • Lords of the Fallen (New release on Xbox, available on PC Game Pass, May 30)
  • Soulmask (New release on Steam, May 31)
  • Path of Exile (Steam)
  • World of Warcraft: Dragonflight (Battle.net)
  • World of Warcraft Classic (Battle.net)
  • World of Warcraft Cataclysm Classic (Battle.net)

And members can look for the following later this month:

  • Autopsy Simulator (New release on Steam, June 6)
  • Chornobyl Liquidators (New release on Steam, June 6)
  • SunnySide (New release on Steam, June 14)
  • Still Wakes the Deep (New release on Steam and Xbox, available on PC Game Pass, June 18)
  • Disney Speedstorm (Steam and Xbox, available on PC Game Pass)
  • Farm Together 2 (Steam)
  • Resident Evil Village (Steam)
  • Star Traders: Frontiers (Steam)
  • Street Fighter 6 (Steam)
  • Torque Drift 2 (Epic Games Store)

More to May

In addition to the 24 games announced last month, four more joined the GeForce NOW library:

  • Senua’s Saga: Hellblade II (New release on Steam and Xbox, available on PC Game Pass, May 21)
  • Serum (New release on Steam, May 23)
  • Palworld (Steam, and Xbox, available on PC Game Pass)
  • Tomb Raider: Definitive Edition (Xbox, available on PC Game Pass)

Gestalt, Norland and Sunnyside have delayed their launch dates to later this year. Stay tuned to GFN Thursday for updates.

From Tamriel to Teyvet, Night City to Sanctuary, GeForce NOW brings the world of PC gaming to nearly any device. Share your favorite gaming destinations all month long using #GreetingsFromGFN for a chance to be featured on the @NVIDIAGFN channels.

What are you planning to play this weekend? Let us know on X or in the comments below.

https://x.com/NVIDIAGFN/status/1795847572793274591

 

Read More

What’s Your Story: Weishung Liu

What’s Your Story: Weishung Liu

Microsoft Research Podcast | What's Your Story | Weishung Liu

In the Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today.

In this episode, Gehrke is joined by Principal PM Manager Weishung Liu. Liu brings product development and management expertise honed at companies such as Disney, Fluke, and SpaceX to her role at Microsoft, where she helped develop the real-time video analytics platform Watch For and today empowers teams within Microsoft Research to maximize their reach. She talks about how being more homebound as a child cultivated the love of people and stories that underlies her professional pursuits and how she landed in tech despite efforts to “rebel” against the expectations that come with growing up in Silicon Valley.

Photos of Weishung Liu, Principal PM Manager, throughout her life.

Transcript

[SPOT]

WEISHUNG LIU: Hey, listeners. I’m Weishung Liu, principal PM manager with Microsoft Research and today’s podcast guest. Before we get started, I want to tell you about Microsoft Research Forum. It’s a series of discussions and talks examining how the rapid advances in AI are impacting science and technology research. The next episode is June 4, and colleagues of mine from around Microsoft Research are participating. I highly recommend checking it out. You can learn more and register now at aka.ms/MyResearchForum. All right, here’s today’s show …

[END OF SPOT] [TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

WEISHUNG LIU: I’ve always felt like I want the things that I work on to create joy in people. … The fact that I can still be here and create impact and do meaningful work and, you know, work on things that create joy and positively impact society, it speaks to me like stories speak to me.

[TEASER ENDS]

JOHANNES GEHRKE: Microsoft Research works at the cutting edge. But how much do we know about the people behind the science and technology that we create? This is What’s Your Story, and I’m Johannes Gehrke. In my 10 years with Microsoft, across product and research, I’ve been continuously excited and inspired by the people I work with, and I’m curious about how they became the talented and passionate people they are today. So I sat down with some of them. Now, I’m sharing their stories with you. In this podcast series, you’ll hear from them about how they grew up, the critical choices that shaped their lives, and their advice to others looking to carve a similar path.

[MUSIC FADES]


In this episode, I’m talking with Principal PM Manager Weishung Liu. Wei has used her love of storytelling and interest in people and their motivations to deliver meaningful products and customer experiences. This includes the creation of a successful line of Disney plush toys and contributions to the satellite internet system Starlink. With Microsoft, she helped develop Watch For, a real-time video analytics platform that has gone on to enhance gaming via streaming highlights and to support content moderation in products such as Xbox. Today, she’s facilitating connections and devising strategies to empower teams within Microsoft Research to maximize their reach. Here’s my conversation with Wei, beginning with her childhood in Silicon Valley.

JOHANNES GEHRKE: Hi, Wei. Welcome to What’s your Story. You’re our principal PM manager here in the lab, and we’ll talk in a little while about, you know, what you’re doing here right now, but maybe let’s start with, how did you actually end up in tech? Where did you grow up?

WEISHUNG LIU: Oh, wow. OK. So this is a very long, long and, like, nonlinear story about how I got into tech. So I grew up in Silicon Valley, which one would assume means just, like, oh, yes, you grew up in Silicon Valley; therefore, you must be in the STEM field, and therefore, you will be in tech for the rest of your life.

GEHRKE: Yep, that’s, sort of, a too familiar a story.

LIU: That’s a very linear story. And I totally actually wanted to rebel against that whole notion of going into tech. So I grew up in Silicon Valley and thought, like, man, I want to not do STEM.

GEHRKE: So did your parents want you to be either a doctor or engineer? Is that the … ?

LIU: Absolutely. It was either a doctor, engineer, or lawyer. So thankfully my sister went the PhD in psychology route, so she, kind of, checked that box for us. And so I was a little bit more free to pursue my very, very, very wide variety of interests. So a little bit of personal information about me. So I grew up a very sick child, and so I was hospitalized a lot. I was in the ER a lot. But that actually afforded me a lot of opportunities to be, sort of, an indoor-only child of reading and playing video games and all sorts of things that I would say, like, expanded my worldview. Like, it was just all sorts of different stories. Like, reading has stories; video games have stories.

GEHRKE: Tell us a story about reading and a story about video games. What …

LIU: Oh my goodness …

GEHRKE: … were your favorite set of books?

LIU: I was really interested in, like, historical fiction at the time. One book that I remember reading about—oh my gosh, it’s a very famous book, and I don’t remember the name anymore. However, it was about a young girl’s perspective of being, living in an internment camp, the Japanese internment camps, back during World War II, I believe, after Pearl Harbor.[1] And it was just kind of her diary and her perspective. It was almost like Diary of Anne Frank but from a Japanese American girl’s perspective instead. And I just loved, kind of, reading about different viewpoints and different eras and trying to understand, like, where do we overlap, how do things change over time, how does history repeat itself in some ways? And, and I love that. And then video games. So I was really into Japanese RPGs back in the day. So it’s funny. I started … my first console was a Mattel Intellivision II, and then it gradually went up to like Nintendo, Super Nintendo, all those, all those consoles. But I had a friend who I used to play RPGs with …

GEHRKE: So these were network RPGs or individual RPGs?

LIU: These were individual RPGs. This is, you know, when I was around 10, the internet appeared, so it probably dates me a little bit. Every time a new RPG came out like by—the company is now called Square Enix but back then it was called SquareSoft—or Nintendo like Zelda, he and I would immediately go out and buy the game or, you know, convince our parents at the time to buy the game, and then we would compete. So, like, this is not couch co-op; he was actually in Texas.

GEHRKE: Like long-distance co-op?

LIU: This is long-distance, long-distance gaming where we would compete to see who would beat the game first.

GEHRKE: Wow.

LIU: No, you’re not allowed to use walkthroughs. And he almost always beat me.

GEHRKE: But these games are like 60-hour, 80-hour games?

LIU: Yeah, like 60- or 80-hour games, but, like, you know, we got so good at them that, well, you had to figure out like how do you, kind of, bypass and get through the main quest as fast as possible. So that was always—

GEHRKE: So any of the side quests and things like that just … ?

LIU: Yeah, oh, yeah, no. So I’m actually a huge completionist, though, so I’d always go back after and do all the side quests to get, you know, we’ll just say “100 percent” achievement. I’m a little bit of an achievement machine that way. But so, like, that kind of stuff was always super fun for me. And so I spent so much of my time then—because I was, kind of, more homebound a lot—just exploring and being curious about things. And, and that got me into art and into design, and I thought, man, I’m going to be an architect someday because I love designing experiences, like spaces for people.

GEHRKE: You thought at that point in time like a real, like a building architect or an architect for like virtual worlds or so … ?

LIU: No, real, like a real physical space that people inhabit and experience. And so, like, I avoided as much STEM as I could in school. I couldn’t, just due to where I lived and grew up and the high school requirements that I had. But the minute I went to college, which happened to be at the University of Washington, which has a great architecture program, I was like, I’m never going to take another STEM class in my life.

GEHRKE: So you enrolled as an architecture major?

LIU: I enrolled as an architecture major, and I was like, I will do what we would call the “natural world” credits, which is kind of the STEM-like things. But I would intentionally find things that were not, like, hard science because I’m like, I’m never going to do this again. I’m never going to be in tech. All these people that are so obsessed with tech who, you know, went to MIT and Stanford, and I’m like, no, no, no, I’m going to be an architecture major.

GEHRKE: So you took, like, the physics for poets class or so …?

LIU: Stuff like that, right. [LAUGHS] Very, very similar. But I ended up just loving learning at school, which is very unsurprising. You know, I took, like, an Arabic poetry class. I took a French fairy tales class. And I just, kind of, explored college and all the things that it had to offer in terms of academics so much that I actually ended up deciding to get two degrees: one in industrial design, which is not too far away from architecture. Architecture is like with large spaces, like you build one building or design one building that lasts maybe 100 years. Industrial design, I, kind of, joke about it. It’s, you know, you design smaller form factors that sometimes, if they’re manufactured with plastics, last millions of years, [LAUGHS] and you build millions of them. But then I also ended up getting a degree in comparative religion, as well. Which it meant that, like, my schooling and my class schedules are always a little bit odd because I’d go from, you know, like, the industrial design shop down in our design building and like making things with my hands and working at the bandsaw, and then I’d, you know, rush to this other class where we have like very fascinating philosophical debates about various things in, sort of, the comparative religion space. And I’d write, you know, 10-page essays and … about all sorts of things. And, you know, there’s, like, the study of death is a great example and how different cultures react to death. But, you know, that was as far away from STEM [LAUGHS] as I could have possibly gone.

GEHRKE: Right. I was just thinking, can you maybe explain to our listeners a little bit who may come a little bit more from the STEM field traditionally, what do you study in comparative [religion], and what is the field like?

LIU: So for me, it was really just, like, I took a lot of classes just trying to understand people. I really … and it sounds, kind of, silly to say it that way, but religion is really formed and shaped by people. And so for me, like, the types of classes that I took were, sort of, like studying Western religion, studying Eastern religion, studying the philosophy of religion, like or even—and this still, I still think about it from time to time—how do you define religion? And just even … there’s still so many scholarly debates about how to define, like, what is a “pure” definition of religion, and nobody can really still identify that yet. Is it, you know, because then there’s this distinction of spiritualism and being religious versus something else or just completely made-up, you know, pseudoscience, whatever, right. People have this wide spectrum of things that they describe. But it’s really around learning about the different foundations of religion. And then people tend to specialize. You know, they might specialize in a particular area like Hinduism or, you know, broadly speaking, Eastern religions, or people will, you know, start focusing on Western religions. Or sometimes I think about a specific topic like the intersection of, for example, religion and death or religion and art or even, you know, religion and violence. And there’s a broad spectrum of things that people start specializing in. And it’s very, it’s, sort of, very much in the mind but very much in the heart of how you understand that.

GEHRKE: Yeah, I can see how it even connects to industrial design because there you also want to capture the heart …

LIU: Yes.

GEHRKE: … the hearts of people, right.

LIU: Yep. And that’s kind of how I, how I describe, you know, when people are like, why did you major in that? Like, what do you even do with that? Did you even think about what career you would have with that? I’m like, no, I just really wanted to learn, and I really wanted to understand people. And I felt like religion is one way to understand, sort of, like, sociologically how people think and get into that deep, like, that deep feeling of faith and where does it come from and how does it manifest and how does it motivate people to do things in life. And to your point, it’s very similar to industrial design because you’re, you know, we talk about design thinking and you have to really deeply understand the user and the people that you’re designing for in order to create something that really lasts, that matters to them. So that’s, kind of, my, at least my undergrad experience. And in a very, very brief way, I’ll just kind of walk through or at least tell you the very nonlinear path that I took to get to where I am here now at Microsoft Research. So like the day after I graduated from the University of Washington, I moved to Florida.

GEHRKE: And just as a question: so you graduated from the University of Washington—did you have like a plan, you know, this is like the career I want to have?

LIU: Oh no! So here’s the funny thing about design, and I hope that, you know, my other, the designers who might be watching or listening [LAUGHS] to this might not get upset—hopefully don’t get upset with me about this—is I love the design thinking aspect of design, like understanding why people do the things they do, what types of habits can you build with the products—physical products? I was very obsessed with physical, tangible things at the time. And then I learned through, like, internships and talking to other designers who were, you know, already in the field that that’s not what they do. That they don’t go and like, oh, let’s go talk to people and understand deeply what they do. Like, there’s other people that do that. OK, well, what do you do? Well, I work in, you know, CAD, or I work on SolidWorks, or I do Rhino, and I do surfacing. I’m like, OK, what else? Who decides what gets made? Oh, that’s like, you know, a product manager or product—oh, what’s that? Who? What? What does that even mean? Like, tell me more about that.

GEHRKE: So it’s like the dichotomy that you see even here in the company where the engineers have to, sort of, build the things, but the product managers are …

LIU: But someone else is …

GEHRKE: … in the middle

LIU: … someone else is, kind of, interpreting what the market and the users are saying, what the business is saying. And I was like, I like doing that because that’s more about understanding people and the business and the reason—the why. And so …

GEHRKE: Just before you go to your career, I mean, I must … I have to ask, what are some of the favorite things that you built during your undergrad? Because you said you really like to build physical things.

LIU: Oh my gosh!

GEHRKE: Maybe one or two things that you actually built …

LIU: Yeah …

GEHRKE: … that was, sort of, so fun.

LIU: So one of my projects was actually a Microsoft-sponsored project for one quarter, and all they showed up with—his name’s Steve Kaneko. He retired not too long ago from here. Steve showed up and said, I want you all to design a memory-sharing device.

GEHRKE: Interesting …

LIU: And that was it.

GEHRKE: So what is memory sharing? He didn’t define what that means?

LIU: He didn’t define it because as designers, that was our way of interpret—we had to interpret and understand what that meant for ourselves. And it was a very, very free-form exploration. And I thought … the place that I started from was … at the time, I was like, there’s like 6 or 7 billion people in the world. How many of them do I actually know? And then how many of them do I actually want to know or maybe I want to know better?

GEHRKE: To share a memory with …

LIU: To share my memories with, to share a part of me. Like, memories are …

GEHRKE: Pretty personal.

LIU: … who we are—or not who we are but parts of who we are—and drive who we become in some ways. And so I thought, you know, what would be cool is if you had a bracelet, and the bracelet were individual links, and each individual link was a photo, like a digital photo, very tiny digital photo, of something that you chose to share. And so, you know, I designed something at the time … like, the story I told was, like, well, you know, this woman who’s young decided to go to, you know, she’s taking the bus, and she put on her, like, “I wish to go to Paris” kind of theme, right. So she had a bunch of Parisian-looking things or something in that vein, right. And, you know, she gets on the bus and her bracelet vibrates. There’s, like, a haptic reaction from this bracelet. And that means that there’s someone else on the bus with this, you know, with a bracelet with their memories. It’s kind of an indicator that people want to share their stories with someone else. And, you know, wouldn’t it be great if, you know, this woman now sits down on the bus, because she sits next to the person who’s wearing it. Turns out to be an elderly woman who’s wearing, coincidentally, you know, her Paris bracelet, but it’s of her honeymoon of her deceased husband from many years ago. And, you know, like, think of the power of the stories that they could share with each other. That, you know, this woman, elderly woman, can share with, you know, this younger woman, who has aspirations to go, and the memories and the relationship that they can build from that. And so that was, kind of, my memory-sharing device at the time.

GEHRKE: I mean, it’s super interesting because, I mean, the way I think about this is that we have memory-sharing applications now like Facebook and Instagram and TikTok and so on, but they, the algorithm decides really …

LIU: Yes …

GEHRKE: … who to share it with and where and why to share it. Whereas here, it’s proximity, right? It somehow leads to this physical and personal connection afterwards, right? The connection is not like, OK, suddenly on my bracelet, her stories show up …

LIU: Yes …

GEHRKE: … but, you know, maybe we sit next to each other on the bus, and it vibrates, and then we start a conversation.

LIU: Exactly. It’s you own, you know, whatever content is on that you choose to have on your physical person, but you’re sharing yourself in a different way, and you’re sharing your memories and you’re sharing a moment. And it might just be a moment in time, right. It doesn’t have to be a long-lasting thing. That, you know, this elderly woman can say, hey, there’s this really great bistro that we tried on, you know, this particular street, and I hope it’s still there, because if you go, ask for this person or try this thing out and, like, what an incredible opportunity it is for this other woman, who, you know, maybe she does someday go to Paris and she does find it. And she thinks of that time, like, how grateful she was to have met, you know, this woman on the bus. And just for that brief whatever bus … however long that bus ride was, to have that connection, to learn something new about someone else, to share and receive a part of somebody else who you may never have known otherwise. And then that was, that was what I was thinking of, you know, in terms of a memory-sharing device was memory creates connections or it reinforces connections. So I guess very similarly to my people thing and being fascinated by people, like, this was my way of trying to connect people in a different way, in the space that they inhabit and not necessarily on their devices.

GEHRKE: And then what did Microsoft say to that? Was there like an end-of-quarter presentation?

LIU: Oh, yeah! There was a, there was a, you know, big old presentation. I can’t even remember which building we were at, but I think everybody was just like, wow, this is great. And that was it. [LAUGHTER]

GEHRKE: And that was it. It sounds like a really fascinating device.

LIU: Yeah, it was. And lots of people came up with all sorts of really cool things because everybody interpreted the, I’ll just say, the prompt differently, right.

GEHRKE: Right …

LIU: … And that was my interpretation of the prompt at the time.

GEHRKE: Well, super interesting.

LIU: Yeah.

GEHRKE: Coming back to, so OK, so you’ve done just a bunch of really amazing projects. You, sort of, it seems like you literally lived the notion of liberal education.

LIU: I did. I, like, even now I just love learning. I get my hands on all sorts of weird things. I picked up whittling as a random example.

GEHRKE: What is whittling? Do I even know what that is? [LAUGHS]

LIU: So whittling is basically carving shapes into wood. So … I’m also very accident prone, so there’s, like, lots of gloves I had to wear to protect my hands. But, you know, it was like, oh, I really just want to pick up whittling. And I literally did, you know. You can grab a stick and you can actually buy balsa wood that’s in a, in decent shape. But you can just start carving away at whatever … whatever you would like to form that piece of wood into, it can become that. So I made a cat, and then I made what I jokingly refer to as my fidget toy at home. It’s just a very smooth object. [LAUGHS]

GEHRKE: That you can hold and …

LIU: I just made it very round and smooth and you can just, kind of, like, rub it, and yeah, it’s …

GEHRKE: Super interesting.

LIU: … it’s … I pick up a lot of random things because it’s just fascinating to me. I learned a bunch of languages when I was in school. I learned Coptic when I was in school for no other reason than, hey, that sounds cool; you can read the Dead Sea Scrolls [LAUGHS] when you learn Coptic—OK!

GEHRKE: Wow. And so much, so important in today’s world, right, which is moving so fast, is a love for learning. And then especially directed in some areas.

LIU: Yeah.

GEHRKE: You know, that’s just really an awesome skill.

LIU: Yeah.

GEHRKE: And so you just graduated. You said you moved to Florida.

LIU: Oh, yes, yes. Yes. So, so about a month before this happened, right—it didn’t just spontaneously happen. A month before, I had a good friend from the architecture program who had said, hey, Wei, you know, I’m applying for this role in guest services at Disney. I was like, really? You can do that? And she’s like, yeah, yeah, yeah. So I was like, that sounds really cool. And I, you know, went to, like, the Disney careers site. I’m like one month or two months away from graduating. Still, like, not sure what I’m totally going to do because at that point, I’m like, I don’t think I want to be a designer because I don’t—the part that I love about it, the part that I have passion about, is not in the actual design of the object, but it’s about the understanding of why it needs to exist.

GEHRKE: The interconnection between the people and the design.

LIU: The people and the design, exactly. And so when I found, I found this, like, product development internship opportunity, and I was like, what does that even mean? That sounds cool. I get to …

GEHRKE: At Disney?

LIU: At Disney. And it was, like—and Disney’s tagline, the theme park merchandise’s tagline, was “creating tangible memories.” I was like, oh boy, this just checks all the boxes. So I applied, I interviewed, did a phone interview, and they hired me within 24 hours. They were like, we would like you to come. And I was like, I would absolutely love to move to Florida and work there. So, yeah, the day after I graduated from U-Dub, I drove all the way across the country from Seattle.

GEHRKE: You drove?

LIU: From Seattle with two cats.

GEHRKE: That must have been an interesting adventure by itself.

LIU: Oh, yes. With two cats in the car, let me tell you, it was fascinating. All the way to Florida, Orlando, Florida. And the day that I got there or, no, two days after I got there, I found out that I was going to be working in the toys area. So plush and dolls, which is, like, you can imagine just absolutely amazing. Making, like, stuffed toys that then—because my office was a mile down the road from Disney’s Animal Kingdom and therefore a couple miles away from Magic Kingdom or Hollywood Studios or EPCOT—I could actually go see, I’ll just say, the “fruits of my labor” instantly and not only that. See it bring joy to children.

GEHRKE: So what is the path? So you would design something, and how quickly would it then actually end up in the park? Or how did you, I mean, how did you start the job?

LIU: What did I do there? Yeah, yeah …

GEHRKE: Well, what’s the interface between the people and the design here?

LIU: Yeah … so, so, really, I didn’t actually do any design. There was an entire group called Disney Design Group that does all the designing there. And so what I did was I understood, what do we need to make and why? What memories are we—what tangible memories do we want to create for people? Why does it matter to them? In many ways, it’s, sort of, like, it’s still a business, right. You’re creating tangible memories to generate revenue and increase the bottom line for the company. But … so my role was to understand what trends were happening: what were the opportunities? What were guests doing in the parks? What types of things are guests looking for? What are we missing in our SKU lineup, or stock-keeping-unit lineup, and then in which merchandising areas do they need to happen? And so I, actually, as part of my internship, my manager said, hey, I let every intern every time they’re here come up with any idea they want, and you just have to see it from start to execution—in addition to all the other stuff that I worked on. I was like, sounds good. And I came up with this idea that I was like, you know, it would be cool … Uglydolls was really popular at the time. Designer toys were getting really popular from Kidrobot, which was kind of, like, there was this vinyl thing and you can—it was just decorative of all different art styles on the same canvas. And I was like, you know, what if we did that with Mickey, and then, you know, what if the story that we’re telling is, you know, just for the parks—Walt Disney World and Disneyland—that there were aliens or monsters coming to visit the park, but they wanted to blend in and fit in? Well, how would they do that? Well, they clearly see Mickey heads everywhere, and Mickey is very popular here clearly, and so they try to dress up like Mickey, but they don’t do it quite well. So they got the shape right, but everything else about them is a little bit different, and they all have their own unique personalities and …

GEHRKE: You can tell a story around them …

LIU: You can tell a story—see, it’s all about stories. And then it … I got buy-in from everybody there, like, all the way up to the VP. I had to get brand because I was messing with the brand icon. But, you know, it became an entire line called Mickey Monsters at Disney. I still have them all. There were two—then it went from plush; it became consumables, which are like edible things. It went into key chains. It went, it was super … it was … I probably went a little bit too hard, or I took the, I think, I took the assignment very seriously. [LAUGHS]

GEHRKE: Yep, yep. Well, it seemed to be a huge success, as well.

LIU: Yeah. It did really well in the time that it was there. We did a test, and I was really, really proud of it. But you know, my—what I did though is, you know, very concretely was I started with an idea. I, you know, convinced and aligned with lots of people in various disciplines that this is something that we should try and experiment on. You know, worked with the designers to really design what this could look like. You know, scoped out what types of fabrics because there’s all sorts of different textures out there. Working with, kind of, our sourcing team to understand, like, which vendors do we want to work with. And then typically, in the plush industry, manufacturing back in the day could happen—and in terms of supply chain, manufacturing, and then delivery of product—could take about six months.

GEHRKE: OK … 

LIU: And so when I was there, anything I worked on would, kind of, appear in six months, which is actually very cool. I mean, it’s not like software, where anything you work on is, you’re like boop, compile—oh look [there] it is. It depends on how fast your computer is. You know, it’s pretty instantaneous compared to six months to see the fruits of your labor. But it was a really, just such a great experience. And then seeing, you know, then going to the parks and seeing children with …

GEHRKE: Yeah, the stuff that you …

LIU: … the thing that I worked on, the thing that I had the idea on, and, like, them going like, Mom, I really want this.

GEHRKE: Right …

LIU: You know, we’re not really selling to the kids; we’re, kind of, selling to the parents.

GEHRKE: It’s a bit like this feeling that we can have here at Microsoft, right, if any of our ideas makes it into products …

LIU: Yup …

GEHRKE: … that are then used by 100 million people and hopefully bring them joy and connection.

LIU: Exactly. And that’s why, like, I just think Microsoft is great, because our portfolio is so broad, and so much of our work touches different parts of our lives. And I’ll even pick on, you know, like I have, you know, in my family, my daughter goes to school—clearly, obviously, she would go to school—but she used Flipgrid, now known as Flip, for a while. And I was like, hey, that’s cool. Like, she uses something that, you know, I don’t directly work on, but my company works on.

GEHRKE: Well, and you were involved with it through Watch For, right …

LIU: Yes, I was …

GEHRKE: … which did become the motivation for Flip.

LIU: Yep. Watch For, you know, helps to detect inappropriate content on Flip. And, you know, that’s super cool because now I’m like, oh, the work that I’m doing actually is directly impacting and helping people like my daughter and making a difference and, you know, keeping users safe from content that maybe we don’t want them to see. You know, other areas like Microsoft Word, I’m like, wow, this is a thing. Like, I’m at the company that makes the thing that I’ve used forever, and, you know, like, it’s just fascinating to see the types of things that we can touch here at Microsoft Research, for example. And how, you know, I, you know, Marie Kondo popularized the term “joy,” like, “sparking joy,” but …

GEHRKE: If you look at an item and if it doesn’t sparkle joy …

LIU: If it doesn’t spark joy, right …

GEHRKE: … then you know on which side it goes.

LIU: Exactly. But, but, you know, like, I’ve always felt like I want the things that I work on to create joy in people. And it was very obvious when you make toys that you see the joy on children’s faces with it. It’s a little bit different, but it’s so much more nuanced and rewarding when you also see, sort of, the products that, the types of things that we work on in research create joy. It’s, you know, it’s funny because I mentioned software is instantaneous in many ways, and then, you know, toys takes a little bit longer. But then, you know, in the types of research that we do, sometimes it takes a little bit longer than, a little bit longer [LAUGHS] …

GEHRKE: It takes years sometimes!

LIU: … than six months. Years to pay off. But, like, that return on that investment is so worth it. And, you know, I see that in, kind of, the work that lots of folks around MSR [Microsoft Research] do today. And knowing that even, sort of, the circles that I hang out in now do such crazy, cool, impactful things that help benefit the world. And, you know, it’s funny, like, never say never. I’m in tech and I love it, and I don’t have a STEM background. I didn’t get a STEM background. I didn’t get it, well, I don’t have a STEM degree. Like, I did not go—like, I can’t code my way out of a paper bag. But the fact that I can still be here and create impact and do meaningful work and, you know, work on things that create joy and positively impact society is, like, it speaks to me like stories speak to me.

GEHRKE: I mean, there’s so many elements that come together in what you’re saying. I mean, research is not a game of the person sitting in the lowly corner on her whiteboard, right? But it’s a team sport.

LIU: Yep.

GEHRKE: It requires many different people with many different skills, right? It requires the spark of ingenuity. It requires, you know, the deep scientific insight. It requires then the scaling and engineering. It requires the PM, right, to make actually the connection to the value, and the execution then requires the designer to actually create that joy with the user interface to seeing how it actually fits.

LIU: Exactly. And it’s fascinating that we sometimes talk about research being like a lonely journey. It can be, but it can also be such an empowering collaborative journey that you can build such incredible cool things when you bring people together—cross-disciplinary people together—to dream bigger and dream about new ideas and new ways of thinking. And, like, that’s why I also love talking to researchers here because they all have such unique perspectives and inner worlds and lives that are frankly so different from my own. And I think when they encounter me, they’re like, she’s very different from us, too.

GEHRKE: But I think these differences are our superpower, right, because …

LIU: Exactly. And that’s what brings us together.

GEHRKE: … they have to be bridged and that brings us together. Exactly. So how, I mean, if you think about Microsoft Research as over here. You’re here in Disney in Florida?

LIU: Yes, yes, yes. So …

GEHRKE: You had quite a few stops along the way.

LIU: I did have a lot of stops along the way.

GEHRKE: And very nonlinear also?

LIU: It was also very nonlinear. So Disney took me to the third, at the time, the third-largest toy company in the US, called JAKKS Pacific, where I worked on again, sort of, Disney-licensed and Mattel-licensed products, so “dress up and role play” toys is what we refer to them as. “Dress up” meaning, like, if you go to your local Target or Walmart or whatever, kind of, large store, they will have in their toy sections like dresses for Disney princesses, for example, or Disney fairies. Like, I worked on stuff like that, which is also very cool because, you know, usually around Halloween time here in the US is when I’m like, hey, I know that. And then that, kind of, took me to a video game accessory organization here in Woodinville.

GEHRKE: There’s the connection to tech starting to appear.

LIU: There’s a little bit connection of tech where I was like, I love video games! And I got to work on audio products there, as well, like headphones. And it was the first time I started working on things that, I’ll just say, had electrons running through them. So I had already worked on things that were, like, both soft lines—we refer to a soft line as bags and things that require, like, fabrics and textiles—and then I worked on hard lines, which were things that are more, things that are more physically rigid, like plastics. And so I was like, OK, well, I’ve worked on hard-lines-like stuff, and now I’m going to work on hard lines with electrons running through them. That’s kind of neat. And I learned all sorts of things about electricity. I was like, oh, this is weird and fascinating and circuits and … . And then I was like, well, this is cool, but … what else is there? And it took me to not a very well-known company in some circles, but a company called Fluke Corporation. Fluke is best known for its digital multimeters, and I worked there on their thermal imaging cameras. So it’s, for people who don’t know, it’s kind of like Predator vision. You can see what’s hot; you can see what’s not. It’s very cool. And Fluke spoke to me because their, you know, not only is their tagline “they keep your world up and running”; a lot of the things that Fluke does, especially when I heard stories from, like, electricians and technicians who use Fluke products, are like, this Fluke saved my life. I’m like, it did? What? And they’re like, you know, I was in a high-voltage situation, and I just wasn’t paying attention. I, you know, didn’t ground properly. And then there was an incident. But, you know, my multimeter survived, and more importantly, I survived. And you’re like, wow, like, that’s, that’s really cool. And so while I was at Fluke, they asked me if I wanted to work on a new IoT project. And I was like, I don’t even know what IoT is. “Internet of Things” … like, OK, well, you said “things” to me, and I like things. I like tangible things. Tell me more. And so that was, kind of, my first foray into things that had … of products with electrons on them with user interfaces and then also with software, like pure software, that were running on devices like your smartphones or your tablets or your computers. And so I started learning more about like, oh, what does software development look like? Oh, it’s a lot faster than hardware development. It’s kind of neat. And then that took me to SpaceX, of all places. It was super weird. Like, SpaceX was like, hey, do you want to come work in software here? I was like, but I’m not a rocket scientist. They’re like, you don’t need to be. I was like, huh, OK. And so I worked on Starlink before Starlink was a real thing. I worked on, kind of, the back-office systems for the ISP. I also worked on what we would refer to as our enterprise resource planning system that powers all of SpaceX. It’s called Warp Drive.

GEHRKE: That’s where you got all your software experience.

LIU: That’s where I learned all about software and working on complex systems, also monoliths and older systems, and how do you think about, you know, sometimes zero-fault tolerance systems and also, that also remain flexible for its users so they can move fast. And then from SpaceX, that took me to a startup called Likewise. It’s here in Bellevue. And then from the startup, I was like, I really like those people in Microsoft. I really want to work in research because they come up with all these cool ideas, and then they could do stuff with it. And I’m such an idea person, and maybe I’m pretty good at execution, but I love the idea side of things. And I discovered that over the course of my career, and that’s actually what brought me here to begin with.

GEHRKE: And that’s, sort of, your superpower that you bring now here. So if I think about a typical day, right, what do you do throughout, throughout your day? What is it, what is it to be a PM manager here at MSR?

LIU: So it’s funny because when I was just a PM and not a manager, I was more, kind of, figuring out, how do I make this product go? How do I make this product ship? How do I move things forward and empower organizations with the products that I—people and organizations on the planet to achieve more [with] what I’m working on? And now as a PM manager, I’m more empowering the people in my team to do that and thinking about uniquely like, who are they, what are their motivations, and then how do I help them grow, and then how do I help their products ship, and how do I help their teams cohere? And so really my day-to-day is so much less, like, being involved in the nitty-gritty details of any project at any point in time, but it’s really meeting with different people around Microsoft Research and just understanding, like, what’s going on and making sure that we’re executing on the impactful work that we want to move forward. You know, it’s boring to say it’s—it doesn’t sound very interesting. Like, mostly, it’s emails and meetings and talking, and, you know, talking to people one-on-one, occasionally writing documents and creating artifacts that matter. But more importantly, I would say it’s creating connections, helping uplift people, and making sure that they are moving and being empowered in the way that they feel that—to help them achieve more.

GEHRKE: That’s super interesting. Maybe in closing, do you have one piece of career advice for everybody, you know, anybody who’s listening? Because you have such an interesting nonlinear career, yet when you are at Disney you couldn’t probably … didn’t imagine that you would end up here at MSR, and you don’t know what, like, we had a little pre-discussion. You said you don’t know where you’re going to go next. So what’s your career advice for any listener?

LIU: I would say, you know, if you’re not sure, it’s OK to not be sure, and, you know, instead of asking yourself why, ask yourself why not. If you look at something and you’re like, hey, that job looks really cool, but I am so unqualified to do it for whatever reason you want to tell yourself, ask yourself why not. Even if it’s, you know, you’re going from toys to something in STEM, or, you know, I’m not a rocket scientist, but somehow, I can create value at SpaceX? Like, if you want to do it, ask yourself why not and try and see what happens. Because if you stop yourself at the start, before you even start trying, then you’re never going to find out what happens next.

[MUSIC]

GEHRKE: It’s just such an amazing note to end on. So thank you very much for the great conversation, Wei.

LIU: Yeah. Thanks, Johannes.

GEHRKE: To learn more about Wei or to see photos of her work and of her childhood in Silicon Valley, visit aka.ms/ResearcherStories (opens in new tab).

[MUSIC FADES]


[1] Liu notes the book was Journey to Topaz by Yoshiko Uchida and the subsequent book Journey Home.

The post What’s Your Story: Weishung Liu appeared first on Microsoft Research.

Read More

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these…Apple Machine Learning Research

TinyAgent: Function Calling at the Edge

TinyAgent: Function Calling at the Edge


The ability of LLMs to execute commands through plain language (e.g. English) has enabled agentic systems that can complete a user query by orchestrating the right set of tools (e.g. ToolFormer, Gorilla). This, along with the recent multi-modal efforts such as the GPT-4o or Gemini-1.5 model, has expanded the realm of possibilities with AI agents. While this is quite exciting, the large model size and computational requirements of these models often requires their inference to be performed on the cloud. This can create several challenges for their widespread adoption. First and foremost, uploading data such as video, audio, or text documents to a third party vendor on the cloud, can result in privacy issues. Second, this requires cloud/Wi-Fi connectivity which is not always possible. For instance, a robot deployed in the real world may not always have a stable connection. Besides that, latency could also be an issue as uploading large amounts of data to the cloud and waiting for the response could slow down response time, resulting in unacceptable time-to-solution. These challenges could be solved if we deploy the LLM models locally at the edge.

Riding the Wayve of AV 2.0, Driven by Generative AI

Riding the Wayve of AV 2.0, Driven by Generative AI

Generative AI is propelling AV 2.0, a new era in autonomous vehicle technology characterized by large, unified, end-to-end AI models capable of managing various aspects of the vehicle stack, including perception, planning and control.

London-based startup Wayve is pioneering this new era, developing autonomous driving technologies that can be built on NVIDIA DRIVE Orin and its successor NVIDIA DRIVE Thor, which uses the NVIDIA Blackwell GPU architecture designed for transformer, large language model (LLM) and generative AI workloads.

In contrast to AV 1.0’s focus on refining a vehicle’s perception capabilities using multiple deep neural networks, AV 2.0 calls for comprehensive in-vehicle intelligence to drive decision-making in dynamic, real-world environments.

Wayve, a member of the NVIDIA Inception program for cutting-edge startups, specializes in developing AI foundation models for autonomous driving, equipping vehicles with a “robot brain” that can learn from and interact with their surroundings.

“NVIDIA has been the oxygen of everything that allows us to train AI,” said Alex Kendall, cofounder and CEO of Wayve. “We train on NVIDIA GPUs, and the software ecosystem NVIDIA provides allows us to iterate quickly — this is what enables us to build billion-parameter models trained on petabytes of data.”

Generative AI also plays a key role in Wayve’s development process, enabling synthetic data generation so AV makers can use a model’s previous experiences to create and simulate novel driving scenarios.

The company is building embodied AI, a set of technologies that integrate advanced AI into vehicles and robots to transform how they respond to and learn from human behavior, enhancing safety.

Wayve recently announced its Series C investment round — with participation from NVIDIA — that will support the development and launch of the first embodied AI products for production vehicles. As Wayve’s core AI model advances, these products will enable manufacturers to efficiently upgrade cars to higher levels of driving automation, from L2+ assisted driving to L4 automated driving.

As part of its embodied AI development, Wayve launched GAIA-1, a generative AI model for autonomy that creates realistic driving videos using video, text and action inputs. It also launched LINGO-2, a driving model that links vision, language and action inputs to explain and determine driving behavior.

“One of the neat things about generative AI is that it allows you to combine different modes of data seamlessly,” Kendall said. “You can bring in the knowledge of all the texts, the general purpose reasoning and capabilities that we get from LLMs and apply that reasoning to driving — this is one of the more promising approaches that we know of to be able to get to true generalized autonomy and eventually L5 capabilities on the road.”

Read More

Enhance image search experiences with Amazon Personalize, Amazon OpenSearch Service, and Amazon Titan Multimodal Embeddings in Amazon Bedrock

Enhance image search experiences with Amazon Personalize, Amazon OpenSearch Service, and Amazon Titan Multimodal Embeddings in Amazon Bedrock

A variety of different techniques have been used for returning images relevant to search queries. Historically, the idea of creating a joint embedding space to facilitate image captioning or text-to-image search has been of interest to machine learning (ML) practitioners and businesses for quite a while. Contrastive Language–Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP) were the first two open source models that achieved near-human results on the task. More recently, however, there has been a trend to use the same techniques used to train powerful generative models to create multimodal models that map text and images to the same embedding space to achieve state-of-the-art results.

In this post, we show how to use Amazon Personalize in combination with Amazon OpenSearch Service and Amazon Titan Multimodal Embeddings from Amazon Bedrock to enhance a user’s image search experience by using learned user preferences to further personalize image searches in accordance with a user’s individual style.

Solution overview

Multimodal models are being used in text-to-image searches across a variety of industries. However, one area where these models fall short is in incorporating individual user preferences into their responses. A user searching for images of a bird, for example, could have many different desired results.

bird 1 bird 2 bird 3
bird 4 bird 5 bird 6

In an ideal world, we can learn a user’s preferences from their previous interactions with images they either viewed, favorited, or downloaded, and use that to return contextually relevant images in line with their recent interactions and style preferences.

Implementing the proposed solution includes the following high-level steps:

  1. Create embeddings for your images.
  2. Store embeddings in a data store.
  3. Create a cluster for the embeddings.
  4. Update the image interactions dataset with the image cluster.
  5. Create an Amazon Personalize personalized ranking solution.
  6. Serve user search requests.

Prerequisites

To implement the proposed solution, you should have the following:

  • An AWS account and familiarity with Amazon Personalize, Amazon SageMaker, OpenSearch Service, and Amazon Bedrock.
  • The Amazon Titan Multimodal Embeddings model enabled in Amazon Bedrock. You can confirm it’s enabled on the Model access page of the Amazon Bedrock console. If Amazon Titan Multimodal Embeddings is enabled, the access status will show as Access granted, as shown in the following screenshot. You can enable access to the model by choosing Manage model access, selecting Amazon Titan Multimodal Embeddings G1, and then choosing Save Changes.

amazon bedrock model access

Create embeddings for your images

Embeddings are a mathematical representation of a piece of information such as a text or an image. Specifically, they are a vector or ordered list of numbers. This representation helps capture the meaning of the image or text in such a way that you can use it to determine how similar images or text are to each other by taking their distance from each other in the embedding space.

bird → [-0.020802604, -0.009943095, 0.0012887075, -0….

As a first step, you can use the Amazon Titan Multimodal Embeddings model to generate embeddings for your images. With the Amazon Titan Multimodal Embeddings model, we can use an actual bird image or text like “bird” as an input to generate an embedding. Furthermore, these embeddings will be close to each other when the distance is measured by an appropriate distance metric in a vector database.

The following code snippet shows how to generate embeddings for an image or a piece of text using Amazon Titan Multimodal Embeddings:

def generate_embeddings_with_titan(image=None, text=None):
    user_input = {}

    if image is not None:
        user_input["inputImage"] = image
    if text is not None:
        user_input["inputText"] = text

    if not user_input:
        raise ValueError("One user input of an image or a text is required")

    body = json.dumps(user_input)

    response = bedrock_runtime.invoke_model(
        body=body,
        modelId="amazon.titan-embed-image-v1",
        accept="application/json",
        contentType="application/json"
    )

    response_body = json.loads(response.get("body").read())

    embedding_error = response_body.get("message")

    if finish_reason is not None:
        raise EmbedError(f"Embeddings generation error: {embedding_error}")

    return response_body.get("embedding")

It’s expected that the image is base64 encoded in order to create an embedding. For more information, see Amazon Titan Multimodal Embeddings G1. You can create this encoded version of your image for many image file types as follows:

with open(Image_Filepath+ "/" + image, "rb") as image_file:
     input_image = base64.b64encode(image_file.read()).decode('utf8')

In this case, input_image can be directly fed to the embedding function you generated.

Create a cluster for the embeddings

As a result of the previous step, a vector representation for each image has been created by the Amazon Titan Multimodal Embeddings model. Because the goal is to create more personalize image search influenced by the user’s previous interactions, you create a cluster out of the image embeddings to group similar images together. This is useful because will force the downstream re-ranker, in this case an Amazon Personalize personalized ranking model, to learn user presences for specific image styles as opposed to their preference for individual images.

In this post, to create our image clusters, we use an algorithm made available through the fully managed ML service SageMaker, specifically the K-Means clustering algorithm. You can use any clustering algorithm that you are familiar with. K-Means clustering is a widely used method for clustering where the aim is to partition a set of objects into K clusters in such a way that the sum of the squared distances between the objects and their assigned cluster mean is minimized. The appropriate value of K depends on the data structure and the problem being solved. Make sure to choose the right value of K, because a small value can result in under-clustered data, and a large value can cause over-clustering.

The following code snippet is an example of how to create and train a K-Means cluster for image embeddings. In this example, the choice of 100 clusters is arbitrary—you should experiment to find a number that is best for your use case. The instance type represents the Amazon Elastic Compute Cloud (Amazon EC2) compute instance that runs the SageMaker K-Means training job. For detailed information on which instance types fit your use case, and their performance capabilities, see Amazon Elastic Compute Cloud instance types. For information about pricing for these instance types, see Amazon EC2 Pricing. For information about available SageMaker notebook instance types, see CreateNotebookInstance.

For most experimentation, you should use an ml.t3.medium instance. This is the default instance type for CPU-based SageMaker images, and is available as part of the AWS Free Tier.

num_clusters = 100

kmeans = KMeans(
    role=role,
    instance_count=1,
    instance_type="ml.t3.medium",
    output_path="s3://your_unique_s3bucket_name/",
    k=num_clusters,
    num_trials=num_clusters,
    epochs=10
)

kmeans.fit(kmeans.record_set(np.asarray(image_embeddings_list, dtype=np.float32)))

Store embeddings and their clusters in a data store

As a result of the previous step, a vector representation for each image has been created and assigned to an image cluster by our clustering model. Now, you need to store this vector such that the other vectors that are nearest to it can be returned in a timely manner. This allows you to input a text such as “bird” and retrieve images that prominently feature birds.

Vector databases provide the ability to store and retrieve vectors as high-dimensional points. They add additional capabilities for efficient and fast lookup of nearest neighbors in the N-dimensional space. They are typically powered by nearest neighbor indexes and built with algorithms like the Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF) algorithms. Vector databases provide additional capabilities like data management, fault tolerance, authentication and access control, and a query engine.

AWS offers many services for your vector database requirements. OpenSearch Service is one example; it makes it straightforward for you to perform interactive log analytics, real-time application monitoring, website search, and more. For information about using OpenSearch Service as a vector database, see k-Nearest Neighbor (k-NN) search in OpenSearch Service.

For this post, we use OpenSearch Service as a vector database to store the embeddings. To do this, you need to create an OpenSearch Service cluster or use OpenSearch Serverless. Regardless which approach you used for the cluster, you need to create a vector index. Indexing is the method by which search engines organize data for fast retrieval. To use a k-NN vector index for OpenSearch Service, you need to add the index.knn setting and add one or more fields of the knn_vector data type. This lets you search for points in a vector space and find the nearest neighbors for those points by Euclidean distance or cosine similarity, either of which is acceptable for Amazon Titan Multimodal Embeddings.

The following code snippet shows how to create an OpenSearch Service index with k-NN enabled to serve as a vector datastore for your embeddings:

def create_index(opensearch_client, index_name, vector_field_name):
    settings = {
      "settings": {
        "index": {
          "knn": True
        }
      },
      "mappings": {
        "properties": {
            vector_field_name: {
              "type": "knn_vector",
              "dimension": 1024,
              "method": {
                "name": "hnsw",
                "space_type": "l2",
                "engine": "faiss",
                "parameters": {
                  "m": 32
                }
              }
            }
        }
      }
    }
    response = opensearch_client.indices.create(index=index_name, body=settings)
    return bool(response['acknowledged'])

The following code snippet shows how to store an image embedding into the open search service index you just created:

    embedding_vector = {"_index":index_name,
                        "name": image_name, 
                        "type": "Image",
                        "embedding": image_embedding,
			 "cluster": image_cluster }
    #opensearch_client is your Amazon Opensearch cluster client
    opensearch_client.index(
        index=index_name,
        body=embedding_vector,
        id = str(index),
        refresh = True
    )

Update the image interactions dataset with the image cluster

When creating an Amazon Personalize re-ranker, the item interactions dataset represents the user interaction history with your items. Here, the images represent the items and the interactions could consist of a variety of events, such as a user downloading an image, favoriting it, or even viewing a higher resolution version of it. For our use case, we train our recommender on the image clusters instead of the individual images. This gives the model the opportunity to recommend based on the cluster-level interactions and understand the user’s overall stylistic preferences as opposed to preferences for an individual image in the moment.

To do so, update the interaction dataset including the image cluster instead of the image ID in the dataset, and store the file in an Amazon Simple Storage Service (Amazon S3) bucket, at which point it can be brought into Amazon Personalize.

Create an Amazon Personalize personalized ranking campaign

The Personalized-Ranking recipe generates personalized rankings of items. A personalized ranking is a list of recommended items that are re-ranked for a specific user. This is useful if you have a collection of ordered items, such as search results, promotions, or curated lists, and you want to provide a personalized re-ranking for each of your users. Refer to the following example available on GitHub for complete step-by-step instructions on how to create an Amazon Personalize recipe. The high-level steps are as follows:

  1. Create a dataset group.
  2. Prepare and import data.
  3. Create recommenders or custom resources.
  4. Get recommendations.

We create and deploy a personalized ranking campaign. First, you need to create a personalized ranking solution. A solution is a combination of a dataset group and a recipe, which is basically a set of instructions for Amazon Personalize to prepare a model to solve a specific type of business use case. Then you train a solution version and deploy it as a campaign.

The following code snippet shows how to create a Personalized-Ranking solution resource:

personalized_ranking_create_solution_response = personalize_client.create_solution(
    name = "personalized-image-reranker",
    datasetGroupArn = dataset_group_arn,
    recipeArn = personalized_ranking_recipe_arn
)
personalized_ranking_solution_arn = personalized_ranking_create_solution_response['solutionArn']

The following code snippet shows how to create a Personalized-Ranking solution version resource:

personalized_ranking_create_solution_version_response = personalize_client.create_solution_version(
    solutionArn = personalized_ranking_solution_arn
)

personalized_ranking_solution_version_arn = personalized_ranking_create_solution_version_response['solutionVersionArn']

The following code snippet shows how to create a Personalized-Ranking campaign resource:

create_campaign_response = personalize_client.create_campaign(
        name = "personalized-image-reranker-campaign",
        solutionVersionArn = personalized_ranking_solution_version_arn,
        minProvisionedTPS = 1
        )

personalized_ranking_campaign_arn = create_campaign_response['campaignArn']

Serve user search requests

Now our solution flow is ready to serve a user search request and provide personalized ranked results based on the user’s previous interactions. The search query will be processed as shown in the following diagram.

personalized image search architecture

To setup personalized multimodal search, one would execute the following steps:

  1. Multimodal embeddings are created for the image dataset.
  2. A clustering model is created in SageMaker, and each image is assigned to a cluster.
  3. The unique image IDs are replaced with cluster IDs in the image interactions dataset.
  4. An Amazon Personalize personalized ranking model is trained on the cluster interaction dataset.
  5. Separately, the image embeddings are added to an OpenSearch Service vector index.

The following workflow would be executed to process a user’s query:

  1. Amazon API Gateway calls an AWS Lambda function when the user enters a query.
  2. The Lambda function calls the same multimodal embedding function to generate an embedding of the query.
  3. A k-NN search is performed for the query embedding on the vector index.
  4. A personalized score for the cluster ID for each retrieved image is obtained from the Amazon Personalize personalized ranking model.
  5. The scores from OpenSearch Service and Amazon Personalize are combined through a weighted mean. The images are re-ranked and returned to the user.

The weights on each score could be tuned based on the available data and desired outcomes and desired degrees of personalization vs. contextual relevance.

Personalized image search weighted score

To see what this looks like in practice, let’s explore a few examples. In our example dataset, all users would, in absence of any personalization, receive the following images if they search for “cat”.

cat 1 cat 2 cat 3
cat 4 cat 5 cat 6

However, a user who has a history of viewing the following images (let’s call them comic-art-user) clearly has a certain style preference that isn’t addressed by the majority of the previous images.

comic-art-user 1 comic-art-user 2 comic-art-user 3
comic-art-user comic-art-user 5 comic-art-user 6

By combining Amazon Personalize with the vector database capabilities of OpenSearch Service, we are able to return the following results for cats to our user:

comic-art-user-cat-1 comic-art-user-cat-2 comic-art-user-cat-3
comic-art-user-cat-4 comic-art-user-cat-5 comic-art-user-cat-6

In the following example, a user has been viewing or downloading the following images (let’s call them neon-punk-user).

neon-punk-user-1 neon-punk-user-2 neon-punk-user-3

They would receive the following personalized results instead of the mostly photorealistic cats that all users would receive absent any personalization.

neon-punk-user-cat-1 neon-punk-user-cat-2 neon-punk-user-cat-3

Finally, a user viewed or downloaded the following images (let’s call them origami-clay-user).

origami-clay-user-1 origami-clay-user-2 origami-clay-user-3

They would receive the following images as their personalized search results.

origami-clay-user-cat-1 origami-clay-user-2 origami-clay-user-3

These examples illustrate how the search results have been influenced by the users’ previous interactions with other images. By combining the power of Amazon Titan Multimodal Embeddings, OpenSearch Service vector indexing, and Amazon Personalize personalization, we are able to deliver each user relevant search results in alignment with their style preferences as opposed to showing all of them the same generic search result.

Furthermore, because Amazon Personalize is capable of updating based on changes in the user style preference in real time, these search results would update as the user’s style preferences change, for example if they were a designer working for an ad agency who switched mid-browsing session to working on a different project for a different brand.

Clean up

To avoid incurring future charges, delete the resources created while building this solution:

  1. Delete the OpenSearch Service domain or OpenSearch Serverless collection.
  2. Delete the SageMaker resources.
  3. Delete the Amazon Personalize resources.

Conclusion

By combining the power of Amazon Titan Multimodal Embeddings, OpenSearch Service vector indexing and search capabilities, and Amazon Personalize ML recommendations, you can boost the user experience with more relevant items in their search results by learning from their previous interactions and preferences.

For more details on Amazon Titan Multimodal Embeddings, refer to Amazon Titan Multimodal Embeddings G1 model. For more details on OpenSearch Service, refer to Getting started with Amazon OpenSearch Service. For more details on Amazon Personalize, refer to the Amazon Personalize Developer Guide.


About the Authors

Maysara Hamdan is a Partner Solutions Architect based in Atlanta, Georgia. Maysara has over 15 years of experience in building and architecting Software Applications and IoT Connected Products in Telecom and Automotive Industries. In AWS, Maysara helps partners in building their cloud practices and growing their businesses. Maysara is passionate about new technologies and is always looking for ways to help partners innovate and grow.

Eric Bolme is a Specialist Solution Architect with AWS based on the East Coast of the United States. He has 8 years of experience building out a variety of deep learning and other AI use cases and focuses on Personalization and Recommendation use cases with AWS.

Read More

End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium

End-to-end LLM training on instance clusters with over 100 nodes using AWS Trainium

Llama is Meta AI’s large language model (LLM), with variants ranging from 7 billion to 70 billion parameters. Llama uses a transformers-based decoder-only model architecture, which specializes at language token generation. To train a model from scratch, a dataset containing trillions of tokens is required. The Llama family is one of the most popular LLMs. However, training Llama models can be technically challenging, prolonged, and costly.

In this post, we show you how to accelerate the full pre-training of LLM models by scaling up to 128 trn1.32xlarge nodes, using a Llama 2-7B model as an example. We share best practices for training LLMs on AWS Trainium, scaling the training on a cluster with over 100 nodes, improving efficiency of recovery from system and hardware failures, improving training stability, and achieving convergence. We demonstrate that the quality of Llama 2-7B trained on Trainium is of comparable quality to the open source version on multiple tasks, ranging from multi-task language understanding, math reasoning, to code generation. We also demonstrate the scaling benefits of Trainium.

What makes distributed training across over 100 nodes so challenging?

Training large-scale LLMs requires distributed training across over 100 nodes, and getting elastic access to large clusters of high-performance compute is difficult. Even if you manage to get the required accelerated compute capacity, it’s challenging to manage a cluster of over 100 nodes, maintain hardware stability, and achieve model training stability and convergence. Let’s look at these challenges one by one and how we address them with Trainium clusters during the end-to-end training:

  • Distributed training infrastructure efficiency and scalability – Training LLMs is both computation and memory intensive. In this post, we show you how to enable the different parallel training algorithms on Trainium and select the best hyperparameters to achieve the highest throughput of Llama 2-7B on the Trainium cluster. We also demonstrate the implementations of other memory and computation optimization techniques such as coalescing layers and data type selection on Trainium. Empirically, we have proven that Trainium clusters can reduce costs by up to 46% compared to comparable Amazon Elastic Compute Cloud (Amazon EC2) instances.
  • Efficient hardware and system recovery – End-to-end LLM training at this scale will inevitably encounter hardware or system failures. We demonstrate how to efficiently enable checkpoint saving and automatically recover using the NeuronX Distributed library. Empirically, we demonstrate that with automatic failure recovery, the effective utilization of hardware computing hours reaches 98.81% compared to 77.83% with a manual recovery method.
  • Training stability and convergence – Finally, frequent occurrence of spikes of loss functions in pre-training deep neural networks such as Llama 2 can lead to catastrophic divergence. Due to the large computation cost required for training LLMs, we want to reduce loss function spikes, improve training stability, and achieve convergence of training. We demonstrate best practices and implementation of techniques such as scaled initialization, gradient clipping, and cache management on Trainium clusters to achieve this. We also show how to monitor and debug for training stability.

Llama 2-7B pre-training setup

In this section, we discuss the steps for setting up Llama 2-7B pre-training.

Infrastructure

Setting up the Llama 2-7B infrastructure consists of the following components:

  • EC2 cluster – The training cluster includes 128 trn1.32xlarge instances (nodes), totaling 2048 Trainium accelerators. The networking among the instances is connected through 8×100 Gbps EFAs. We mounted 56 TB Amazon FSx storage for immediate data storage and checkpoint saving and loading. The raw training data was saved on Amazon Simple Storage Service (Amazon S3) buckets.
  • Orchestration – We first trained the Llama 2-7B from scratch using a trn1.32xlarge cluster that is managed through Amazon Elastic Kubernetes Service (Amazon EKS). For details about the setup procedure, refer to Train Llama2 with AWS Trainium on Amazon EKS. We followed the same procedure but set up the cluster at a much larger scale with 128 trn1.32xlarge instances.
  • Container build – We used a customer Docker image that was built based on the following training containers and included the Llama 2-7B training source files. We stored the customer Docker image in an Amazon Elastic Container Registry (Amazon ECR) registry and deployed it in EKS pods. The following diagram shows the architecture of the cluster and container setup.

Data preparation

The original format of the training dataset contains a large number of compressed files. To use this dataset, we first converted them into a format compatible with the Hugging Face dataset package. We used the Apache Arrow format (the default storage format for datasets) to combine all data into a single file and a single block of a file. This method significantly reduces load times for TB-sized datasets compared to the default method of loading many separate files.

We first downloaded the preprocessed training dataset, a small subset of the full dataset that contains 12 trillion tokens, using a special EC2 instance with 20–30 TB of memory. The data download script is as follows:

    import os
     
    # Cache and tmpdir can be large. Make sure ~/ has enough disk space.
    os.environ["HF_DATASETS_CACHE"] = "~/dataset/cache"
    os.environ["TMPDIR"] = "~/dataset/tmpdir"
     
    import datasets
    from datasets import load_dataset
     
    save_path = "~/<data path>/arrow"
    save_path = os.path.expanduser(save_path)
    os.makedirs(save_path, exist_ok=True)
     
    raw_datasets = load_dataset("togethercomputer/<1T data file name>", 'default', num_proc=448)
    raw_datasets["train"].save_to_disk(
        save_path,
        num_shards=1,
        num_proc=448,
    )

The dataset is processed for optimized storage and access:

    import pyarrow as pa
    import time
     
    a = time.time()
    stream = pa.memory_map("~/<data path>/arrow/train.arrow")
    stream = pa.ipc.open_stream(stream)
    table = stream.read_all()
    print("completed step 1 in seconds: ", time.time() - a)
     
    ca = table["text"]
    l = ca.to_pylist()
    schema = pa.schema({"text": pa.large_string()})
    arr = pa.array(l, type=pa.large_string())
     
    with pa.OSFile("~/<data path>/arrow/train.arrow", "wb") as sink:
        with pa.ipc.new_stream(sink, schema=schema) as writer:
            batch = pa.record_batch([arr], schema=schema)
            writer.write(batch)
    print("completed step 2 in seconds: ", time.time() - a)

On the same instance, we cleaned up the dataset and uploaded the clean dataset to an S3 bucket. We then used a 128 trn1.32xlarge cluster to perform tokenization and packaging (such as dynamically filling sequences and applying masking mechanisms) online during training. Compared with offline packaging methods, this online method saves tremendous development time and computing resources, especially for multiple experiments that use different large datasets and tokenizers.

Model hyperparameters

We adopted the same training hyperparameters as Llama models. Specifically, we used a cosine learning rate scheduler with the same maximum learning rate of 3𝑒−4 and the same minimum learning rate of 3𝑒−5. We followed the same linear warmup of 2,000 steps. The following figure shows a plot of the overall learning rate scheduler.

We used the AdamW optimizer with 𝛽1 = 0.9 and 𝛽2 = 0.95. We used weight decay value of 0.1 for all parameters, including normalization weights. For training stability, gradient-norm clipping of 1.0 was applied. For a different model setup, such as Llama 3, these parameters need to be tuned for optimal performance.

Distributed training infrastructure efficiency and scalability

During the training, we applied general optimization techniques, such as activation checkpointing, model and data parallelism, and computation and communication overlapping in Trainium through the Neuron SDK, as well as some unique enhancement such as BF16 with stochastic rounding. In this section, we list the key features and configurations used in our model pre-training to improve training efficiency.

Model and data parallelism

Neuron supports tensor parallelism (TP), pipeline parallelism (PP), sequence parallelism (SP), and data parallelism (DP). For the 7B model with 4,096 sequence length, we found that a TP degree of 8, PP degree of 1, SP degree of 8, and DP degree of 512 yields the highest training throughput. On a trn1.32xlarge instance cluster, this leads to having four model copies per instance.

We used a global batch size of 1,024 sequences with a maximum sequence length of 4,096 tokens. Each step covered about 4 million tokens. The gradient accumulation step is 2, which resulted in the actual batch size per Neuron core being 1. The following figure illustrates the data parallelism and tensor parallelism we applied in the training.

Neuron Distributed library

AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium-based instances. It includes the compiler, runtime, and profiling tools. It supports a variety of data types, including FP32, BF16, FP16, and stochastic rounding. The Neuron SDK enables tensor parallelism, pipeline parallelism, and data parallelism distributed strategies through the NeuronX Distributed library. This allows trade-offs between preserving the high accuracy of trained models and training efficiency in throughput and memory consumption. We applied the following features in the training process:

  • Selective activation checkpointing – We used selective activation checkpointing to improve training efficiency. It has a slightly higher memory cost than full activation checkpointing, but increases the overall training throughput.
  • BF16 with stochastic rounding – We compared three precision settings: BF16, BF16 with SR, and mixed precision training. Empirically, we found that BF16 with SR showed the same convergence behavior as mixed precision training, with higher training throughput and lower memory footprint; whereas the training loss of BF16 diverged. Therefore, we chose BF16 with SR in our pre-training exercise.
  • Coalescing layers with the same inputs – We coalesced linear layers with the same inputs to reduce the communication in tensor and sequence parallelism, and improve the efficiency of matrix operations. Specifically, the Q, K, and V layers in an attention block are coalesced, and the two linear projections layers in SwiGLU are also coalesced. This optimization technique is generic to LLMs. The following are the example code snippets:

q_proj, k_proj, v_proj were merged into qkv_proj

            if not self.config.separate_qkv and self.num_heads == self.num_key_value_heads and self.config.kv_shared_group_size == 1:
                qkv_states = self.qkv_proj(hidden_states)
                query_states, key_states, value_states = qkv_states.split(self.split_size, dim=2)
            elif self.config.qkv_linear:
                query_states, key_states, value_states = self.qkv_proj(hidden_states)
            else:
                query_states = self.q_proj(hidden_states)
                key_states = self.k_proj(hidden_states)
                value_states = self.v_proj(hidden_states)

gate_proj, up_proj were merged into gate_up_proj

gate_proj, up_proj = self.gate_up_proj(x).split(self.split_size, dim=2)
  • Compiler optimization – We used the compiling flag --distribution-strategy=llm-training to enable the compiler to perform optimizations applicable to LLM training runs that shard parameters, gradients, and optimizer states across data parallel workers. We also used --model-type=transformer, which performs optimizations specific to transformer models. We set the Neuron environment variable NEURON_FUSE_SOFTMAX=1 to enable compiler optimizations on custom lowering for Softmax operation. Finally, we used NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 to reduce training latency with asynchronous runs. This overlaps some runs of accelerators and host (CPU).

The following table summarizes all hyperparameters used in our pre-training exercise.

. . Trn – NxD
Optimization parameters Seq_len 4096
. Precision bf16
. GBS 1024
. learning rate 3.00E-04
. min_lr 3.00E-05
. weight decay 0.1
. grad_clip 1
. LR scheduler cosine
. warmup step 2000
. constant step 0
. AdamW (bete1, beta2) (0.9, 0.95)
. AdamW eps 1.00E-05
Distributed Parameters Number of Nodes 128
. TP 8
. PP 1
. DP 512
. GBS 1024
. Per Neuron BS 1
. Gradient accumulation steps 2
. Sequence Parallel Yes
Steps LR decay steps 480,000
. Training steps 500,000

Hardware and system recovery

Training a billion-parameter LLM often requires training on a cluster with over 100 nodes, running for multiple days or even weeks. The following are best practices of sanity checking the health of the cluster, monitoring cluster health, and efficient recovering from hardware and system failures:

  • Health sanity check and monitoring – It’s important to monitor the health of the computing nodes. In the initial setup, we first did a scrutiny check using the Neuron standard test library to make sure the networking bandwidth performs as expected. During the training, the process can be interrupted due to hardware failures, communication timeouts, and so on. We used Amazon EKS settings to monitor the behavior of the computing nodes. It will send out a warning message if a node or networking goes bad. After that, the cluster stops all the instances and restarts with the health sanity check.
  • Efficient recovery with Neuron automatic fault recovery – To improve the efficiency of fault recovery, NeuronX Distributed supports checkpoint saving and loading. Particularly, it optimizes the checkpoint saving time by supporting asynchronous checkpoint saving. To reduce the overhead of manual intervention, NeuronX Distributed provides an API that automatically loads the latest saved checkpoint before failures and restarts the training. Those APIs are important for achieving high system uptime and therefore finishing end-to-end training. With the automatic node failure recovery and resuming methods, the effective utilization of hardware computing hours reached 98.81% compared to 77.83% with the manual recovery method. The comparison was based on another experimental training run (over 600 billion tokens) without automatic fault recovery, and we observed an average of 20% lower system up time.

Training stability and convergence

During the training process, we found that the training convergence depends on initialization, weight normalization, and gradient synchronization, which can be constantly monitored during the training. The stability depends on reducing frequent distributed file system access. In this section, we discuss the best practices we exercised to improve numeric stability and achieve convergence of the model.

Initialization

We used a scaled initialization strategy for initializing model parameters. Specifically, the initial standard deviation of output layers in attention blocks and MLP layers was scaled by the square root of layer numbers. Similar to what is discussed in the following whitepaper, we found better numerical stability and convergence with smaller initial variance on deeper layers. Additionally, all parameters were initialized on CPU and offloaded to Trainium. The following figure shows that without the scaled initialization (plotted in green and black), the training loss diverged after 22,000–23,000 steps. In contrast, the training loss (plotted in yellow) converges after enabling the scaled initialization. The default initialization is replaced by this code:

scaled_init_method = partial( _init_normal,
config.initializer_range / math.sqrt(2.0 * config.num_hidden_layers))

Gradient synchronization with all-reduce

The gradient all-reduce in torch/xla normalizes the global gradient by world_size instead of data parallelism degrees. When we applied hybrid parallelism including both model parallelism (tensor parallelism and pipeline parallelism) and data parallelism, the world_size was larger than the data parallelism degree. This led to divergence issues because of the incorrect gradient normalization. To fix this, we modified the gradient normalization with a bucket_allreduce_gradients based on data parallelism degrees in NeuronX Distributed. The recommended way is to use neuronx_distributed.parallel_layers.grads.bucket_allreduce_gradients.

Neuron persistent cache on a local worker

When we set up the training cluster, all nodes in the 128 trn1.32xlarge instances shared the same file system, using Amazon FSx for storing data, checkpoints, logs, and so on. Storing the Neuron persistent cache generated from the model compilation on Amazon FSx caused a communication bottleneck because those cached graphs are frequently checked by all Trainium devices in the cluster. Such bottlenecks led to a communication timeout and affected training stability. Therefore, we instead stored Neuron persistent caches (compiled graph binary) in the root volume of each local worker.

Training stability monitoring

During the training, we monitored the training loss, L2-norm of gradients, and L2-norm of parameters for debugging the training stability.

Monitoring the training loss curve gives us the first high-level stability signal. We used TensorBoard to monitor the training loss curve and validation loss curve, as shown in the following figure. The entire model was trained on 1.8 trillion tokens. We observed that the training loss decreases fast for the initial 250 billion tokens and enters a log-linear decrease afterwards.

Monitoring the gradient norm and parameter norms

We monitored the gradient norm as an early signal of divergence. Rapid growth of the gradient norm means (more than three times growth from lowest value) or persistent spikes (benign spikes should return the normal values within a few iterations) can lead to divergence issues. In our training, we observed an ensured gradient norm trending even with BF16, as illustrated in the following figure.

The spikes in our gradient norm often last for a single step and don’t impact the overall training convergence. Specifically, we first tracked a running average (𝑟) of the gradient norm over a window of 20 steps to smooth out the natural fluctuations due to batching. We defined occurrence of a gradient spike when the current gradient norm is higher than 𝑟 + 0.1. Next, we tracked the number of steps for the gradient norm returning to less than 𝑟 + 0.1. Over 86%, the spike deviates from running average for only a single step, as shown in the following figure.

Finally, we also monitored the parameter norm. This metric is a good way to monitor convergence during the initialization stage. For this setup, the initial values are around 1,600, which is expected based on empirical training results from other hardware.

Training results

In this section, we present the results for model quality evaluation and throughput scalability.

Model quality evaluation

The whole training process takes a few weeks. With the saved pre-training model, we benchmarked the model quality based on different tasks and compared it with OpenLlama 2-7B. The following table benchmarks the accuracy over a variety of tasks: MMLU, BBH, common reasoning, world knowledge, reading comprehension, math, and code. For OpenLLaMA 2, we used the available pre-trained weights and evaluated using the same evaluation pipeline as our pre-trained model. Overall, the model trained on Trn1 shows better or comparable accuracy for all tasks except common reasoning.

Task Shots Metric Llama2-7B on trn1 OpenLlama-2
MMLU (5 shot) 5 accuracy 41.318 (3.602) 41.075 (3.611)
BBH (3 shot) 3 multiple_choice_grade 36.565 (1.845) 35.502 (1.861)
Common Reasoning 0 accuracy 56.152 (1.194) 56.893(1.195)
. . accuracy_norm 59.455 (1.206) 61.262(1.19)
World Knowledge (5 shot) Average exact match 38.846 (0.534) 37.023 (0.52)
Reading Comprehension 0 accuracy 72.508 (0.781) 72.416 (0.782)
Math 8 accuracy 9.401 (0.804) 5.231 (0.613)
Code 0 pass@1 7.62 9.06
. . pass@10 19.83 23.58
. . pass@100 34.15 40.24

We also verified that the model accuracy keeps increasing by training more tokens in the dataset. For comparison, we tracked the model accuracy using saved intermediate checkpoints for different tasks, as shown in the following figures.

The first figure shows the model accuracy for world knowledge.

The following figure shows the model accuracy for common reasoning.

The following figure shows the model accuracy for math.

We observed that the accuracy increases with more training tokens for different tasks.

The model quality could be further improved with fine-tuning for specific tasks based on domain specific dataset.

Throughput scalability

In addition to the model quality, we checked the training throughput scaling and got more than 90% scaling efficiency for Llama 2-70B for 64 instances, as shown in the following figure. The Llama 2-7B scaling efficiency is slightly lower because the model size is relatively small for a cluster at this scale.

Clean up

To clean up all the provisioned resources for this post, use the following code and the cleanup script described in Train Llama2 with AWS Trainium on Amazon EKS:

./cleanup.sh

Conclusion

This post showed the end-to-end training example for the Llama 2-7B model with up to 1.8 tokens of dataset on 128 trn1.32xlarge clusters. We discussed best practices to overcome the challenges associated to this type of large model training: hardware stability and recovery, model training stability and convergence, and throughput optimization. The saved training model demonstrated good model quality for the general tasks and showed great cost benefit training on AI purpose-built Trainium accelerators. To learn more about the model architectures supported for training on Trainium and access tutorials, refer to Training Samples/Tutorials.

Reference

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium, https://arxiv.org/pdf/2404.10630


About the Authors

Jianying Lang is a Principal Solutions Architect at AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI field. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.

Fei Chen has 15 years’ industry experiences of leading teams in developing and productizing AI/ML at internet scale. At AWS, she leads the worldwide solution teams in Advanced Compute, including AI accelerators, HPC, IoT, visual and spatial compute, and the emerging technology focusing on technical innovations (AI and generative AI) in the aforementioned domains.

Haozheng Fan is a software engineer at AWS. He is interested in large language models (LLMs) in production, including pre-training, fine-tuning, and evaluation. His works span from framework application level to hardware kernel level. He currently works on LLM training on novel hardware, with a focus on training efficiency and model quality.

Hao Zhou is a Research Scientist with Amazon SageMaker. Before that, he worked on developing machine learning methods for fraud detection for Amazon Fraud Detector. He is passionate about applying machine learning, optimization, and generative AI techniques to various real-world problems. He holds a PhD in Electrical Engineering from Northwestern University.

Yida Wang is a principal scientist in the AWS AI team of Amazon. His research interest is in systems, high-performance computing, and big data analytics. He currently works on deep learning systems, with a focus on compiling and optimizing deep learning models for efficient training and inference, especially large-scale foundation models. The mission is to bridge the high-level models from various frameworks and low-level hardware platforms including CPUs, GPUs, and AI accelerators, so that different models can run in high performance on different devices.

Jun (Luke) Huan is a Principal Scientist at AWS AI Labs. Dr. Huan works on AI and data science. He has published more than 160 peer-reviewed papers in leading conferences and journals and has graduated 11 PhD students. He was a recipient of the NSF Faculty Early Career Development Award in 2009. Before joining AWS, he worked at Baidu Research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI startup, and worked as the CEO and Chief Scientist from 2019–2021. Before joining the industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015–2018, he worked as a program director at the US NSF, in charge of its big data program.

Read More

Fine-tune large multimodal models using Amazon SageMaker

Fine-tune large multimodal models using Amazon SageMaker

Large multimodal models (LMMs) integrate multiple data types into a single model. By combining text data with images and other modalities during training, multimodal models such as Claude3, GPT-4V, and Gemini Pro Vision gain more comprehensive understanding and improved ability to process diverse data types. The multimodal approach allows models to handle a wider range of real-world tasks that involve both text and non-text inputs. In this way, multimodality helps overcome the restrictions of pure text models. LMMs have the potential to profoundly impact various industries, such as healthcare, business analysis, autonomous driving, and so on.

However, a general-purpose language model can only process relatively simple visual tasks such as answering basic questions about an image or generating short captions. This is primarily due to the lack of access to detailed pixel-level information, object segmentation data, and other granular annotations that would allow the model to precisely understand and reason about the various elements, relationships, and context within an image. Without this fine-grained visual understanding, the language model is constrained to more superficial, high-level analysis and generation capabilities related to images. Fine-tuning LMMs on domain-specific data can significantly improve their performance for targeted tasks. The prospect of fine-tuning open source multimodal models like LLaVA are highly appealing because of their cost effectiveness, scalability, and impressive performance on multimodal benchmarks. For those seeking flexible and economical solutions, the ability to use and customize these powerful models holds immense potential.

In this blog post, we demonstrate how to fine-tune and deploy the LLaVA model on Amazon SageMaker. The source code is available in this GitHub repository.

LLaVA overview

LLaVA is trained end-to-end to enable general-purpose understanding across both visual and textual data. In the LLaVA model architecture, pre-trained language models such as Vicuna or LLaMA are combined with visual models such as CLIP’s visual encoder. The integration converts the visual features from images into a format that matches the language model’s embeddings through a projection layer.

LLaVA training happens in two stages, as shown in Figure 1 that follows. The first stage is pre-training, which uses image-text pairs to align the visual features with the language model’s embeddings. In this stage, the visual encoder and language model weights are kept frozen, and only the projection matrix is trained. The second stage is fine-tuning the whole model end-to-end. Here, the visual encoder’s weights are frozen, while the projection layer and language model are updated.

Figure 1: LLaVA architecture

Prepare data

When it comes to fine-tuning the LLaVA model for specific tasks or domains, data preparation is of paramount importance because having high-quality, comprehensive annotations enables the model to learn rich representations and achieve human-level performance on complex visual reasoning challenges. In this post, we focus on preparing an instruction dataset.

Data annotation

The dataset should contain image text pairs that involve reasoning to answer questions about images. To help the model gain comprehensive understanding during the training process, text data should be enriched with contextual nuances. For example, instead of simply asking the model to describe the image, ask specific questions about the image and relating to its content.

To demonstrate LLaVA’s capabilities, we created a small synthetic dataset focused on understanding and interpreting infographics and charts. We used Amazon Bedrock and Python for this task. Specifically, we employed the Amazon Bedrock LLaMA2-70B model to generate text descriptions and question-answer pairs based on those descriptions. Subsequently, we used Python to generate different types of visual presentation such as pie charts and funnel charts based on the text descriptions. If you already have an existing dataset, this method can be used as a data augmentation technique to expand your dataset and potentially enhance the fine-tuning outcome. By creating synthetic examples of text descriptions, question-answer pairs, and corresponding charts, you can augment your dataset with multimodal examples tailored to your specific use case.

The dataset we created consists of image-text pairs, with each image being an infographic, chart, or other data visualization. The corresponding text is a series of questions about the infographic along with ground truth answers, formatted in a question-answer style intended to resemble how a human might ask the model about the information contained in the image. Some examples of generated questions for images as shown in Figure 2 include:

  • What is the percentage of people who spend less than 2 hours a day on screen time?
  • What proportion of people do not exercise at all weekly?
  • How many people are teachers?

Figure 2: Example charts in the training dataset (left is a pie chart of distribution of daily screen time, right is a funnel chart of occupation)

Data structure

These image-text pairs must be formatted in JSON lines (.jsonl) format, where each line is a training sample. An example training sample follows. Specifically, the id field is the unique identifier of a training sample, the image field specifies the name of the image, and the conversations field provides a question-and-answer pair.

{
  "id": "1",
  "image": "screen_time.png",
  "conversations": [
    {
      "from": "human",
      "value": "What is the percentage of people who spend less than 2 hours a day on screen time?"
    },
    {
      "from": "gpt",
      "value": "15%"
    }
  ]
}

By training the model to answer in-depth and analytical questions about infographics it hasn’t seen before, we aim to strengthen model’s ability to generalize its understanding of data visualizations and draw accurate insights.

Fine tune the model

After the data is prepared, we upload it to Amazon Simple Storage Service (Amazon S3) as the SageMaker training input. In configuring the SageMaker training job, we use the TrainingInput object to specify the input data location in Amazon S3 and define how SageMaker should handle it during training. In this case, input_mode='FastFile' indicates the use of S3 fast file mode, which is ideal for scenarios where the dataset is stored as individual files in S3. S3 fast file mode is also advantageous when working with large datasets or when fast access to data is critical for training performance.

from sagemaker.inputs import TrainingInput

training_input = TrainingInput(
    s3_data_type="S3Prefix",  # Available Options: S3Prefix | ManifestFile | AugmentedManifestFile
    s3_data=s3uri,
    distribution="FullyReplicated",  # Available Options: FullyReplicated | ShardedByS3Key
    input_mode="FastFile",
)

We will reuse the training script from LLaVA, which uses DeepSpeed for training efficiency. DeepSpeed is a library that helps train very large deep learning models faster and more efficiently. ZeRO, short for Zero Redundancy Optimizer, is a memory optimization technique in DeepSpeed that reduces the required memory footprint for data parallelism by partitioning optimization states and gradients across data-parallel processes, enabling larger model sizes and batch sizes within limited GPU memory. This allows you to train much larger models on the same hardware. ZeRO Stage 2 reduces memory usage by splitting the model’s optimizer state, gradients, and parameters across multiple processes. Each process only stores a part of these, reducing the memory needed per process. If you run into CUDA memory errors with this configuration, try the Stage 3 configuration instead. Stage 3 offloads gradients to the CPU, which slows training but might solve the memory issue. The training command follows. See the LLaVA: Large Language and Vision Assistant on GitHub for more details about the training parameters

#!/bin/bash
# Set the prompt and model versions directly in the command
deepspeed /root/LLaVA/llava/train/train_mem.py 
--deepspeed /root/LLaVA/scripts/zero2.json 
--lora_enable True 
--lora_r 128 
--lora_alpha 256 
--mm_projector_lr 2e-5 
--bits 4 
--model_name_or_path /root/LLaVA/llava/llava-v1.5-7b 
--version llava_llama_2 
--data_path /root/dataset/train/dataset.json 
--validation_data_path /root/dataset/validation/dataset.json 
--image_folder /root/dataset/images/ 
--vision_tower openai/clip-vit-large-patch14-336 
--mm_projector_type mlp2x_gelu 
--mm_vision_select_layer -2 
--mm_use_im_start_end False 
--mm_use_im_patch_token False 
--image_aspect_ratio pad 
--group_by_modality_length True 
--bf16 True 
--output_dir /root/LLaVA/llava/checkpoints/llama-2-7b-chat-task-qlora 
--num_train_epochs 500 
--per_device_train_batch_size 32 
--per_device_eval_batch_size 32 
--gradient_accumulation_steps 1 
--evaluation_strategy “epoch” 
--save_strategy "steps" 
--save_steps 50000 
--save_total_limit 1 
--learning_rate 2e-4 
--weight_decay 0. 
--warmup_ratio 0.03 
--lr_scheduler_type "cosine" 
--logging_steps 1 
--tf32 True 
--model_max_length 2048 
--gradient_checkpointing True 
--dataloader_num_workers 4 
--lazy_preprocess True 
--report_to wandb

LLaVA allows you to fine-tune all parameters of the base model or use LoRA to tune a smaller number of parameters. LoRA’s strategy keeps the original pre-trained model backbone unchanged and adds new, easier-to-train layers. This allows quick adaptation to new tasks without retraining the whole network. You can use the lora_enable parameter to specify the fine-tuning method. For full parameter fine-tuning, ml.p4d.24xlarge is recommended, while ml.g5.12xlarge is sufficient for LoRA fine-tuning if LLaMA-13B language model is used.

The following code initializes a SageMaker Estimator using the HuggingFace SDK. It sets up a SageMaker training job to run the custom training script from LLaVA. This allows the script to be run within the SageMaker managed environment, benefiting from its scalability. Then we bring our own Docker container to run the SageMaker training job. You can download the Docker image from this code repo, where the dependencies of the training LLaVA model are installed. To learn more about how to adapt your own Docker container to work with SageMaker, see adapting your own training container.

huggingface_estimator = HuggingFace(
    entry_point="finetune-lora-piechart-QA.sh",
    source_dir="./LLaVA",
    instance_type=instance_type,
    instance_count=instance_count,
    py_version=PYTHON_VERSION,
    image_uri=CONTAINER_URI,
    role=ROLE,
    metric_definitions=metric_definitions,
    environment=environment,
    use_spot_instances=use_spot_instances,
    max_run=max_run,
    max_wait=max_wait,
    output_path=output_uri,
    checkpoint_s3_uri=checkpoint_uri,
)

For logging purpose, you can use metric definitions to extract key metrics from the training script’s printed logs and send them to Amazon CloudWatch. The following is an example metric definition that logs training loss at each epoch, the model’s learning rate, and training throughput.

metric_definitions = [
    {"Name": "loss", "Regex": "'loss': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "learning_rate", "Regex": "'learning_rate': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "epoch", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_runtime", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_samples_per_second", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_steps_per_second", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
    {"Name": "train_loss", "Regex": "'epoch': ([0-9]+(.|e-)[0-9]+),?"},
]

Deploy and test

After the training job finishes, the fine-tuned model is uploaded to Amazon S3. You can then use the following code to deploy the model on SageMaker.

HF_TASK = "question-answering"
config = dict(HF_TASK=HF_TASK)
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=s3_model_path,
    role=get_execution_role(),
    transformers_version=TRANSFORMERS_VERSION,
    pytorch_version=PYTORCH_VERSION,
    py_version=PYTHON_VERSION,
    model_server_workers=1,
    env=config,
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=instance_count, instance_type=instance_type
)

For testing, provide an image and question pair and make an inference call against the SageMaker endpoint as follows:

prompt = "what is this chart about?"
data = {
    "image": http_img_path,
    "question": prompt,
    "temperature": 0.1,
}
output = predictor.predict(data)

Conclusion

Our exploration into fine-tuning the LLaVA visual language model on Sagemaker for a custom visual question answering task has shed light on the advancements made in bridging the gap between textual and visual comprehension. LLaVA represents a significant step forward in multimodal AI, demonstrating the ability to jointly understand and reason about textual and visual information in a unified model. By using large-scale pretraining on image-text pairs, LLaVA has acquired robust visiolinguistic representations that can be effectively adapted to downstream tasks through fine-tuning. This enables LLaVA to excel at tasks that require deep comprehension of both modalities, such as visual question answering, image captioning, and multimodal information retrieval. However, the fine-tuning mechanism has limitations. In particular, the adjustment of the projection layer and language model themselves while freezing the vision model presents a set of challenges, such as the requirement for a massive amount of data and the lack of capability in handling challenging vision tasks. Confronting these challenges directly allows us to unlock the full potential of multimodal models, paving the way for more sophisticated applications.

Acknowledgement

The authors extend their gratitude to Manoj Ravi, Jenny Vega, and Santhosh Kuriakose for their insightful feedback and review of the post.

Reference


About the Authors

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families.

Jun Shi is a Senior Solutions Architect at Amazon Web Services (AWS). His current areas of focus are AI/ML infrastructure and applications. He has over a decade experience in the FinTech industry as software engineer.

Alfred Shen is a Senior AI/ML Specialist at AWS. He has been working in Silicon Valley, holding technical and managerial positions in diverse sectors including healthcare, finance, and high-tech. He is a dedicated applied AI/ML researcher, concentrating on CV, NLP, and multimodality. His work has been showcased in publications such as EMNLP, ICLR, and Public Health.

Read More

Research Focus: Week of May 27, 2024

Research Focus: Week of May 27, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: May 27, 2024

Register now for Research Forum on June 4

Join us for Research Forum (opens in new tab), an event series that explores recent research advances, bold new ideas, and important discussions with the global research community in the era of general AI. 

In Episode 3, researchers at Microsoft emphasize the importance of globally equitable AI, and will share novel use cases, transformative applications from industry to material design, and provide updates on AutoGen and MatterGen. 

Your registration includes access to our live chat with researchers on the event day. 

Episode 3 will air Tuesday, June 4 at 9:00 AM PT.

Generative AI and the Politics of Visibility

Generative AI tools have a remarkable capacity to produce complicated and lengthy texts, with just simple direction from users. AI proponents assert they can help writers, providing creative suggestions, completing half-written sentences or story fragments, and inventing character backstories. But this raises questions about the politics of visibility: what kinds of stories do these tools tend to generate, and what do they generally leave out? Do these tools fully represent diverse or marginalized populations and non-normative communities?

In a recent paper: Generative AI and the Politics of Visibility, a researcher from Microsoft tested three widely available generative AI tools (Bing Chat, ChatGPT, and Google’s Bard, now Gemini) with prompts designed to reveal their normative assumptions, prompting the tools multiple times with each to track the diversity of the outputs to the same query. His research demonstrates that, at least as currently designed and trained, generative AI tools tend to reproduce normative identities and narratives, rarely representing less common arrangements and perspectives unless specifically prompted. When they do generate variety, it is often narrow, maintaining deeper normative assumptions in what remains absent.


Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


ACM MMSys 2024 Bandwidth Estimation in Real Time Communications Challenge

Videoconferencing has become indispensable for everything from global business operations to accessible education, transforming the way people communicate across physical barriers and geographical divides. The quality of experience (QoE) delivered by video conferencing systems depends in part on correctly estimating the capacity of the bottleneck link between the sender and the receiver over time. Bandwidth estimation for real-time communications (RTC) remains a significant challenge, primarily due to the continuously evolving heterogeneous network architectures and technologies. From the first bandwidth estimation challenge hosted by Microsoft at ACM MMSys 2021, researchers learned that bandwidth estimation models trained with reinforcement learning (RL) in simulations to maximize network-based reward functions may not be optimal, due to the sim-to-real gap and the difficulty of aligning network-based rewards with user-perceived QoE. In this year’s ACM MMSys 2024 Bandwidth Estimation in Real Time Communications Challenge, researchers from Microsoft aim to align reward maximization with user-perceived QoE optimization using offline RL and a real-world dataset released by Microsoft Teams. The challenge received enthusiastic participation from both academia and industry. All models submitted to the grand challenge underwent initial evaluation, and top models were further evaluated on a geographically distributed testbed. Challenge results show that by leveraging real-world data and integrating objective audio/video quality scores as rewards, offline RL can facilitate the development of competitive bandwidth estimators for RTC.


Player-Driven Emergence in LLM-Driven Game Narrative

Game creation is a labor-intensive process, with limited automation of non-graphic game elements related to dialogue and narrative structure. These elements are typically hand-coded and rigidly deterministic, with few options presented to the player. Large language models (LLMs) are beginning to show potential in the creation of richer and more expansive narrative spaces. 

In a recent paper: Player-Driven Emergence in LLM-Driven Game Narrative, accepted for presentation at the IEEE Conference on Games 2024, researchers from Microsoft in collaboration with members of the Xbox organization explore how interaction with LLMs can empower players to participate in the evolution of game narratives. As a testbed, they created a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise but can freely interact with non-player characters generated by GPT-4, a state-of-the-art LLM. They recruited 28 gamers to play the game and used GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player’s gameplay. Through their interactions with the non-deterministic behavior of the LLM, players were able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.


Segmentation using large language models: A new typology of American neighborhoods

The U.S. Census Bureau’s American Community Survey (ACS) is the country’s primary source of social and economic data. But much of the data is low quality, especially at the highest levels of geographic detail (Block Groups). As one zooms in geographically on a map, the resolution of social and economic data decreases, which is counterintuitive. Typically, zooming in generates more detail, not less. Recent changes in the U.S. statistical system have amplified this geographic-demographic resolution trade-off.

In a recent paper: Segmentation using large language models: A new typology of American neighborhoods, researchers from Microsoft present a solution to this problem in the form of an AI-based open and reproducible geodemographic classification system using small area estimates from the ACS. They employ a partitioning clustering algorithm to a range of socio-economic, demographic, and built environment variables. Using an open-source software pipeline ensures adaptability to future data updates. One key innovation is the integration of GPT-4, to generate intuitive cluster descriptions and names. This represents a novel application of natural language processing in geodemographic research and showcases the potential for human-AI collaboration within the geospatial domain.


From Local to Global: A Graph RAG Approach to Query-Focused Summarization 

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables LLMs to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as: “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods fail to scale to the quantities of text indexed by typical RAG systems.

In a recent preprint: From Local to Global: A Graph RAG Approach to Query-Focused Summarization, researchers from Microsoft propose combining the strengths of these contrasting methods through a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. This approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pre-generate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, Graph RAG leads to substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of generated answers.

Microsoft Research in the news


Microsoft Announces New Foundation Model For Digital Pathology, Diving Deeper Into Clinical Medicine  

Forbes | May 22, 2024

In partnership with Providence health system and the University of Washington, Microsoft has leveraged its significant work with generative AI to launch GigaPath, the first whole-slide foundation model for digital pathology that has been pre-trained with real-world data.


Spanish mini-satellites bring the internet to isolated areas (en español)  

La Razon | May 17, 2024

The Spanish company Fossa, with help from Microsoft Research, has successfully tested a small satellite weighing less than a kilogram that improves connectivity in places with little or no coverage, a potential boost for the internet of things (IoT).

The post Research Focus: Week of May 27, 2024 appeared first on Microsoft Research.

Read More

The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.

Introduction

In today’s data-driven world, organizations strive to leverage data to train and adapt AI models. However, this pursuit often faces an important challenge: balancing the value of data with the need to safeguard individuals’ right to privacy and comply with data privacy regulations like the General Data Protection Regulation (opens in new tab) (GDPR) and the EU AI Act (opens in new tab)

Synthetic data has emerged as a powerful solution to privacy and compliance challenges. It allows organizations to create realistic and useful datasets, tailored to specific use cases, without compromising individual privacy. This enables organizations to:

  • Train and adapt AI models: Synthetic data can be used to train and adapt models to specific domains and industries, even when real-world data is limited, or privacy concerns exist.
  • Comply with regulations: Since it doesn’t require user data, synthetic data generation helps organizations adhere to data privacy regulations.
  • Unlock new possibilities: Synthetic data opens doors to innovative AI applications that were previously limited by data availability or privacy constraints.

Microsoft’s Phi-3 (opens in new tab) small language model (SLM) is a good example of how synthetic data can contribute to responsible AI development, enabling the creation of powerful language models without compromising privacy. Phi-3 leverages a combination of “textbook quality” web data and LLM-generated synthetic content, creating a strategic approach that doesn’t need real-world personal data. 

However, synthetic data carries limitations. It can be difficult to artificially generate realistic data that anticipates a wide range of use cases and individual scenarios. Furthermore, synthetic data generated by pre-trained large-language models (LLMs) can sometimes reduce accuracy and increase bias on down-stream tasks (opens in new tab). So, how could we generate synthetic data that accurately captures the diversity and specificity of private data while maintaining strict privacy protections for data contributors? 

Differential privacy: A bridge between innovation and privacy

Differentially private (DP) synthetic data generation is a promising solution. It allows developers to pursue innovations in machine learning while prioritizing privacy. The goal of synthetic data generation is to produce data statistically similar to real-world data sources. However, when the data is too similar, replicating uniquely identifying details of the source data, the promise of preserving privacy is compromised. This is where DP can help. DP is a mathematical framework for providing a guarantee that a particular computation is relatively invariant to the addition or removal of a single data contributor. Using DP techniques, researchers can generate synthetic datasets that retain the statistical properties of the original data while ensuring that information that could help identify data contributors remains obscured. 

This blog post explores recent advancements in private synthetic data generation. We examine four recently published research papers that propose innovative techniques for generating synthetic data with strong privacy guarantees, while maintaining its usefulness for analytics, training AI models, and other tasks.

In the remainder of this blog post, we describe each approach in more detail, and present experimental results illustrating their value.

Technical deep dive: Differentially private synthetic data generation 

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Generative LLMs offer the opportunity to produce synthetic text by sampling from LLM outputs. One avenue to generating realistic synthetic text is to fine-tune an LLM using representative data. For example, we could consider fine-tuning a pre-trained LLM on a corpus of scientific papers, enabling the model to more readily produce text that captures the knowledge and writing style used in scientific writing. Suppose, however, that we want to produce synthetic text based on a private corpus of documents. What steps can we take to protect the document authors and any sensitive information in their documents? For example, we may want to produce synthetic medical notes, or personal emails. LLMs have a well-known capacity to memorize training examples, and a model with the potential for reproducing samples from the training set might pose significant privacy risks.

In the paper Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe, researchers from Microsoft presented an approach to leveraging a private data corpus for synthetic generation, without compromising the privacy of the data subjects. This approach uses differentially private stochastic gradient descent (DP-SGD) to fine-tune an LLM on the private documents with a strong privacy guarantee. Differentially private model training provides a mathematical guarantee that the trained model parameters, and any subsequent model outputs, are relatively unaffected by the addition or removal of any single user’s training examples.

The synthetic generation approach described in this work was validated by training on restaurant reviews with varying levels of privacy protection, then prompting the model to generate novel reviews. These reviews were then used for downstream classification tasks, such as sentiment prediction and restaurant genre classification, and the results, which are shown in Table 1, demonstrated only small accuracy penalties compared to training on the raw private data. This approach unlocks a powerful way for realistic synthetic data to be generated from private data without compromising privacy or confidentiality.

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.
Figure 1: By fine-tuning an LLM with differential privacy, the model can be used to generate synthetic examples that resemble the private corpus 
A table of results with four columns and four rows. The columns indicate data type, data generator, epsilon, rating and category.  The first row indicates “original” data type and no entry for data generator or epsilon. The rating is 0.733 and category is 0.775.  The following three rows all indicate Synthetic for data type and GPT2, GPT2-Medium, and GPT2-Large for the data generator. Each row is further divided into two rows corresponding to epsilon = 4 and epsilon = infinity respectively. In all cases the rating and category scores are lower than the row marked original by a few percentage points. The rows corresponding to epsilon = 4 are lower than corresponding rows marked epsilon=infinity by 1-2 percentage points. In general the epsilon = 4 rows have increased scores for larger GPT2 models, while the epsilon=infinity rows are relatively flat.
Table 1: Various versions of GPT-2 were trained on restaurant reviews both with (ε=4) and without (ε =∞) a privacy guarantee. These models were used to produce synthetic training sets, which were used to train classification models for review rating and restaurant category, and subsequently evaluated for accuracy on a private hold-out set. The results show that models trained on the synthetic data can achieve accuracy competitive with models trained without a privacy guarantee. 

Differentially Private Synthetic Data via Foundation Model APIs

While the ACL paper demonstrated a robust approach to synthetic data generation, fine-tuning a large model can be impractical. Model training requires significant computing capacity and some of the most powerful models available are proprietary and not accessible for DP training. Recognizing this challenge, researchers at Microsoft explored whether synthetic data can be generated directly using only inference API access to a model, even while utilizing an untrusted model controlled by a third party. Crucially, the synthetic data should resemble a targeted private corpus, and yield a similar DP guarantee as was met in the previous work based on model training. In two separate papers, the authors demonstrate an approach to this problem using a differentially private sampling approach called Private Evolution (PE). 

Two independent flow charts. In the first, private data is applied to a pre-trained model using DP-SGD. The fine-tuned model is used to produce differentially private synthetic data.  In the second chart, a pre-trained model is prompted via its API to produce generic data. Private data is used to inform selection of the generated data, with a strong privacy guarantee, yielding differentially private synthetic data.
Figure 2: Instead of fine-tuning pre-trained models with DP-SGD (top figure), Private Evolution (PE) only requires accessing the inference APIs of a model (bottom figure). Thus, PE is easily compatible with foundation models that are difficult to DP-fine-tune (e.g., because they are too large) or infeasible to fine-tune (e.g., they are only accessible through inference APIs).

Synthetic image generation using foundation model APIs: In Differentially Private Synthetic Data via Foundation Model APIs 1: Images, the authors introduced Private Evolution (PE), an approach that enables DP image synthesis merely through inference APIs of a generative model. PE operates by sampling from a pre-trained diffusion model such as Stable Diffusion, which has no knowledge of the private corpus. PE then iteratively compares these samples to the private corpus, keeps the ones that are most similar to the private corpus, and uses the pre-trained model to generate more such samples. Crucially, the comparison to the private corpus is done with a DP guarantee, so that any information revealed about the private corpus is strictly bounded. Also, all the queries to the foundation model APIs satisfy the same DP guarantee, so that we can safely use APIs provided by (untrusted) third parties. 

Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 
Figure 3: Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 

Even without doing any model training, PE significantly advances state-of-the-art results on some of the datasets. For example, on CIFAR10 dataset (opens in new tab), we achieve FID score (image quality measure, smaller is better) ≤ 7.9 with DP privacy cost ϵ = 0.67, significantly improving the previous SOTA from ϵ = 32. In the paper, we also show that PE requires less computational resource (GPU hours) than DP fine-tuning to achieve such results. 

A 2D line chart with six line series, comprising conditional and unconditional variations on the private evolution and DP-MEPF methods, as well as DP-GAN and DP-Diffusion. The x axis presents values of epsilon from 0 to 32. The y axis presents values of the image quality measure FID from 0 to 80, where values are better.  All six series show decreasing values of FID for increasing values of epsilon. Both of the series corresponding to private evolution show significantly lower FID values, ranging from about epsilon = 0.1 to epsilon = 2.
Figure 4: FID (image quality measure, lower is better) vs. DP privacy cost ϵ on CIFAR10 (δ = 10−5 ). (Un)cond means (un)conditional generation. Ours achieves the best privacy-quality trade-off compared to prior training-based approaches.
An array of ten rows of thumbnails, each row depicting ten instances of generated synthetic images. The rows include birds, cars, cats, dogs, and other animals, planes, boats and trucks.  Most of the images appear to be realistic with some exhibiting unusual artifacts.
Figure 5: Private Evolution-generated samples using CIFAR-10 as the private corpus (ε =0.67, δ =10-5). Each row corresponds to one object class.

Synthetic Text Generation using foundation model APIs: the PE approach described above works well for images since it is easy to produce nearby perturbations of promising images. In Differentially Private Synthetic Data via Foundation Model APIs 2: Text, Microsoft researchers explored whether a similar approach could be applied to text. Their method, called Augmented Private Evolution (Aug-PE), operates similarly to the basic PE approach, but leverages the power of a pre-trained LLM to produce variations and re-wordings of input text. Aug-PE also proposes some fundamental algorithmic improvements that may benefit future development of PE. 

An overview of the Augmented Private Evolution algorithm for synthetic text generation. Step 1 invokes a language model to produce random text. Step 2.1 uses private data and differential private to vote on the best candidates from step 1, Step 2.2 samples from this differentially private histogram to produce a selected set of generations. Step 2.3 prompts a language model to produce variants of the selected generations, and steps 2.1 to 2.3 are repeated.
Figure 6: Augmented Private Evolution (Aug-PE) leverages a foundational LLM to synthesize text and compare in a privacy-preserving way with a private corpus. Similar to PE for images, in Aug-PE, samples that more closely resemble the private data are retained and refined to produce new synthetic text with a strong privacy guarantee. The illustration shows how we generate DP synthetic reviews for restaurants given two private samples.

Results show that Aug-PE is a promising alternative to DP-fine-tuning for DP text synthesis. With the same foundation model, PE can match or even beat DP-fine-tuning in terms of the trade-off between text quality and privacy. Moreover, as Aug-PE only requires inference APIs, Aug-PE can easily work with the most advanced LLMs such as GPT-3.5, LLaMA, and Mixtral to further improve the text quality. In terms of computational cost (GPU hours), PE can achieve up to 65.7x speedup compared to the DP fine-tuning approach.

A table of results for area and rating classification accuracy for a variety of models and comparing PE with DP synthesis. The table contains the remark that with the same model PE matches or beats DP fine-tuning on text quality vs privacy, and PE works well with advanced LLMs which may be challenging or impossible to fine-tune. The models compared include three sizes of GPT-2, several major open source models, and GPT-3.5. PE on the Mixtral model shows the strongest Area classification accuracy at 43.6 while PE on GPT-3.5 shows the strongest Rating classification accuracy at 43.1.
Table 2: Results on ICLR 2023 paper reviews (ϵ = 1). We use each method to generate DP synthetic paper reviews and test the utility of the data by training downstream paper area or rating classifiers and evaluate their accuracies on the real hold-out data (higher is better). Under the same base model (GPT-2 families), PE achieves competitive results with DP fine-tuning. PE also supports advanced LLMs that may be challenging to work with DP fine-tuning due to large model sizes or black box access.

Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation

In-context learning is a technique for performing tasks with an LLM by providing a sample of demonstration examples in the prompt of the LLM before presenting it with a specific task. For example, we might show a few movie plots and their genre and ask the LLM to suggest the genre for a particular plot of interest. In-context learning harnesses the strong generalization capabilities of LLMs, but it requires a sample of labeled demonstration examples at inference time. How can we perform in-context learning when the only available labeled examples are private? A naïve solution might be to use the private examples but hide the demonstration prompt from the user. However, the threat posed by jailbreak attacks puts these examples at risk for exposure to a malicious user.

In Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation, Microsoft researchers explored how demonstration examples can be synthesized from a private corpus with a privacy guarantee. The method operates by incrementally drawing samples from a token distribution defined by the private examples but with noise added to the distribution. The noise is calibrated to ensure a bound on the privacy lost with each sample. The research demonstrated that in-context learning can out-perform zero-shot learning (querying a model without any demonstration examples) and comes close to performing at the same level as the case with no privacy mitigations, as shown in Table 3. 

An overview of differentially private few-shot generation.  A round of token generation is depicted with four steps. Given the tokens generated so far, step 1 selects the relevant private data. Step 2 takes an M by N sample of the private data, producing M batches of N examples. Step 3 assembles M LLM prompts with task instructions and the N examples appended. Step 4 feeds the M prompts to the LLM and performs noisy aggregation over the LLM’s output probabilities to select the next generated token.
Figure 7: Illustration of DP few-shot generation. The example shows a synthetic demonstration generated token by token for the topic school with a differentially private guarantee. As new tokens are sampled, the private examples inform the sampling probability of each subsequent token, with noise injected to preserve privacy. 
A table of results for private in-context learning tasks, including text classification on three datasets (AGNews, DBPedia, and TREC) and information extraction on two datasets (MIT-G and MIT-D).  Accuracy is compared across two cases where epsilon = 0 (zero-shot and four-shot) and values of epsilon at 1, 2, 4, 8 and infinity. Generally, accuracy improves as epsilon increases but epsilon = 8 often outperforms epsilon = infinity.
Table 3: For classification and information extraction tasks, DP in-context learning achieves accuracy similar to non-private ICL (ϵ =∞) 

Conclusion

Synthetic data generation presents enormous opportunities to develop AI systems without compromising end-user privacy. In this blog post, we have explored recent innovations in synthetic data generation with strong privacy guarantees. These approaches can enable practitioners to produce synthetic data from private entities, while mitigating the risk that private information might be revealed. While these approaches are highly promising, they do have limitations. For example, we are currently limited to producing relatively short text passages. Future work will continue to explore the opportunities presented by these approaches, with an aim to produce increasingly realistic data with strong privacy guarantees.

Acknowledgments: The authors are grateful for the contributions of the co-authors of the papers reviewed in this blog post: Xiang Yue, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, Chulin Xie, Arturs Backurs, Sivakanth Gopi, Da Yu, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Janardhan Kulkarni, Xinyu Tang, Richard Shin, Andre Manoel, and Niloofar Mireshghallah.

The post The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI appeared first on Microsoft Research.

Read More