Just Tech: Centering Community-Driven Innovation at the Margins episode 2 with Dr. Tawanna Dillahunt, Zachary Rowe, and Joanna Velazquez

Headshots of podcast guests Tawanna Dillahunt, Zachary Rowe, Joanna Velazquez, and host Mary Gray in a two-by-two grid and set against a dark purple background. Each headshot is contained within a hexagon shape.

Episode 134 | March 31, 2022

In “Just Tech: Centering Community-Driven Innovation at the Margins,” Senior Principal Researcher Mary L. Gray explores how technology and community intertwine and the role technology can play in supporting community-driven innovation and community-based organizations. Dr. Gray and her team are working to bring computer science, engineering, social science, and communities together to boost societal resilience in ongoing work with Project Resolve. She’ll talk with organizers, academics, technology leaders, and activists to understand how to develop tools and frameworks of support alongside members of these communities.

In this episode of the series, Dr. Gray talks with Dr. Tawanna Dillahunt, Associate Professor at University of Michigan’s School of Information, Zachary Rowe, Executive Director of Friends of Parkside, and Joanna Velazquez, Campaign Manager at Detroit Action. The guests share personal experiences where community and research collaborations have been most impactful in solving problems, talk about ways that participatory research can foster equal partnerships and fuel innovation, and offer perspectives on how researchers can best work with communities to work through problems at a local level. They also discuss the role that technology plays—and doesn’t play—in their work.

Learn more:

Subscribe to the Microsoft Research Podcast:
iTunes | Email | Android | Spotify | RSS feed


Transcript

[MUSIC PLAYS OVER INTRODUCTION]

Mary Gray: Welcome to the Microsoft Research Podcast series “Just Tech: Centering Community-Driven Innovation at the Margins.” I’m Mary Gray, a Senior Principal Researcher at our New England lab in Cambridge, Massachusetts. I use my training as an anthropologist and communication media scholar to study people’s everyday uses of technology. In March 2020, I took all that I’d learned about app-driven services that deliver everything from groceries to telehealth to study how a coalition of community-based organizations in North Carolina might develop better tech to deliver the basic needs and health support to those hit hardest by the pandemic. Our research together, called Project Resolve, aims to create a new approach to community-driven innovation—one that brings computer science, engineering, the social sciences, and community expertise together to accelerate the roles that communities and technologies could play in boosting societal resilience. For this podcast, I’ll be talking with researchers, activists, and nonprofit leaders about the promises and challenges of what it means to build technology with rather than for society.

[MUSIC ENDS]

My guests today are Zachary Rowe, Joanna Velazquez, and Dr. Tawanna Dillahunt. Tawanna Dillahunt is an associate professor at the University of Michigan’s School of Information, working at the intersection of human-computer interaction, environmental, economic, and social sustainability and equity. Joanna Velazquez is a campaign manager at Detroit Action, a union of black and brown low- and no-income, homeless and housing-insecure Detroiters fighting for housing and economic justice. And Zachary Rowe is the executive director of Friends of Parkside, a not-for-profit, community-based organization dedicated to working with residents and other stakeholders to better the community surrounding the Detroit public housing complex in which it’s located. Tawanna, Joanna, Zachary, welcome to the podcast.

Zachary Rowe: Why, thank you.

Tawanna Dillahunt: Thanks for having us, Mary. So glad to be here.

Joanna Velazquez: Yes. Thank you. Thank you.

Mary Gray: I’m glad you’re here. I’m glad you’re here. So, I want to start us off thinking about what you believe you’re involved in when you say you’re involved in community-based work. So I want us to start by really defining some terms and seeing the range of how we think about this work we call community-driven innovation, community engagement. I’d like to ask each of you to tell us a little bit about how you got involved in community-based work—broadly defined—not just the tech piece of it—but what brought you into community-based work? Let me start with Dr. Dillahunt.

Tawanna Dillahunt: Sure, Mary. Thanks so much for that question. Um, you know, when I think about this, I think about my upbringing in North Carolina, a very small town in North Carolina, so it was very community-focused, community-oriented, and my grandfather was a farmer. Him and his wife owned a country store in which, you know, I worked in, so they were really serving the community, creating jobs in the community. My dad was a contractor, so built a lot of the homes in our neighborhood. My mom as a retired schoolteacher, and my sister, um, wrote grants in the community as a part of a public-housing community, and she kind of brought me into that work as well. So I feel like I was born and raised in two communities. It’s a part of my DNA.

Mary Gray: Mm, I love the “part of your DNA.” So, let me turn the question to you, Zachary Rowe. What got you involved in community-based work?

Zachary Rowe: You know, that’s a great question, and I was just sort of listening to Tawanna and how my upbringing also positioned me to be involved in community-based work. For me, growing up in public housing, one of the things that I realized early on is the perception of young folks who lived in public housing. Which, you know, a lot of times, 99.9 percent of the time, you know, folks had a negative perception of kids who lived in public housing. So, I remember, my friends and I, we were not like that perception. I’m not sure why I came up with this idea to change their perception, uh, but we started to do a lot of volunteer work in the community. One of the things that was happening in the community is that we had a lot of boarded-up units, in the neighborhood, and so, you know, we connected with an adult, and he bought the paint, and we painted all the boarded-up units a single color. And when you think about it, it doesn’t really make sense, but it made a major difference in the community. It was still boarded up, it was still paint, but it made a difference, you know, and it also sent a message that people cared. Um, then we started to do other things in the neighborhood, you know, started to have parties for kids and whatnot, and we even received an award from the city council. And so, for me, just, how I got started in community work had to do with changing the perception of young folks.

Mary Gray: Mmm. So, how about you, Joanna Velazquez?

Joanna Velazquez: Yes, this is a lovely question to kick us off. You know, similar to Tawanna, I feel like this is what we bonded on a little bit as we, like, got to know each other is, like, being born into community and just knowing how valuable, like, relationships are. My mom and my sisters and I moved to Detroit when I was 5 in October of 2000, and that was a really important moment in our life because, as a single mom, it was community that got us by. It was our pretend aunts and cousins that, you know, to outsiders it’s pretend, but to us it’s real. You build these beautiful spaces that are just full of love and joy, and it’s community that did that. You know, my grandma was in the Southwest area, and, like, everyone knew her. She was the neighborhood babysitter. So just, like, having these examples that community was super important is what followed me in life, and started volunteering at a very young age, kept it going, got me through college, and now I’m here, so—

Mary Gray: So, okay. I would love to turn to each of you and just hear, what’s a project you’re working on right now or a campaign that’s important to you that you’re most excited about sharing with listeners who are tuning into this program? So let me start with you, Zachary. Can you tell us a bit about what you’re working on that you want to bring to our listeners?

Zachary Rowe: One of the projects that I’m working on is, uh, what we’re calling our Community Tech Worker project. It’s loosely modeled after the Community Health Worker project, and I’m excited because one of the things that it does for me is that it gives me the opportunity to match my love of technology with my 9-to-5 job. In my other life I have a small computer consulting business, and so I always wanted to be able to connect the two, and so the Community Tech Worker project is allowing me to be able to share, you know, my passion for technology with residents. And also, it’s doing it on a level that makes sense, so we’re meeting them where they are. I’m excited.

Mary Gray: Can you say a little bit more about who you’re meeting and where they’re at when you’re meeting them?

Zachary Rowe: So, basically, I think in order to understand, you know, the Community Tech Worker project the way I envision it, it’s probably helpful just to maybe talk about the “who” we’re talking about. So, Friends of Parkside is a small, community-based organization located in Detroit in one of the public-housing sites, uh, in Detroit called the Villages At Parkside, and it was started by residents of the housing complex, and so the who that we’re talking about is public-housing residents. And when you talk about the digital divide or the lack of sort of digital skills, I mean, you’re talking about, you know, my community, and you’re probably talking about other communities across the country. And so, what the Community Tech Worker project will allow us to do is to be able to help residents develop basic computer skills so they can turn around and help other residents. Some people call it the “train-the-trainers model” or whatnot, but for us, it’s “reach-one-to-teach-one” kind of thing.

Mary Gray: And, Tawanna, can you just share a bit about what you’re working on and, uh, what the connections are to Zachary and Joanna?

Tawanna Dillahunt: I’m very much, uh, excited about the Community Tech Workers project for the same reason that Zachary mentioned, um, except I’m kind of a full-time professor, and I’m able to combine my passion of the community with, you know, my kind of full-time job, so I see, uh, the Community Tech Workers project as an opportunity to create a new position within a community that hopefully we can sustain over the long-term. Our team imagines that perhaps, you know, those Community Tech Workers who want to pursue a longer-term career in, let’s say, technology can train as a Community Tech Worker and then, you know, move onto maybe even, uh, jobs in IT, and then again, with the train-the-trainer model, have more tech workers who are embedded in the community, and so we’ve, you know, extended this project to, uh, support entrepreneurs, uh, so Professor Julie Hui and I are partnering with Detroit Neighborhood Entrepreneurs Project at, uh, Michigan and creating, again, capacity in the community—more tech workers to support small business owners who might need support with their businesses. I’ll add, uh, the work that, you know, we’ve done with Joanna and Detroit Action is really thinking about models and mechanisms to create opportunities for the community to imagine ways in which technology can support them. So imagining a future. It could be utopian future. It could be, you know, in our activity, we also did dystopian futures. And thinking about what are the community values, and what are the community strengths, and what are opportunities for technology to leverage both strengths and values to move toward the futures that the community imagines? So this is a way to bring the community’s perceptions into what technology can do, instead of kind of enforcing our technologist lens, you know, what we think might be nice. But it’s a way to bring the voices of the community in, uh, to our process.

Mary Gray: Joanna, would you tell us a little bit about work that you’re doing?

Joanna Velazquez: Yes. Yes, yes, yes. Actually, I want to pick it up a little bit, uh, from work between Tawanna and I, covering the Alternative Economy series. That five-week series was so incredible, and like Tawanna had said, it allowed folks to vision, and it allowed folks to imagine: what would they want if they could get their most perfect world where their needs were met and, you know, folks around them had what they needed as well? We created a space to meet folks where they’re at but also, like, ” Let’s think together. Let’s imagine together.” And why that’s so important and how we did that was because it activated our members to tap in into our Agenda for a New Economy, and so that’s the current work that we’ve got going on right now. I’m very excited about this campaign because it’s an entire platform that is aiming to address the root causes of poverty, the root causes of injustice, and really from a community-driven and community-organizing and civic-engagement point of view of how to get this agenda for a new economy forward. And it was because we had that visioning that we were able to continue to build with our members afterwards to allow them to guide this work, to develop this campaign, and then we launched it in December. And then, come this year, what’s really exciting is that this past Saturday, we actually just had a people’s forum, and part of the Agenda for a New Economy is getting reparations for folks who have been illegally foreclosed on due to overassessments here in the city on property taxes. And even those who are currently homeowners but still dealing with the overassessments in property taxes. We had over 700 community members call into a Zoom session this past Saturday to meet with the entire city council. These city council members were able to listen and hear directly from these impacted folks on their stories, on what they think is right, how they want compensation to look. Is it home repairs? Is it property tax credits? Is it creating systems to support families who have dealt with this crisis? You know, there’s emotional and mental trauma that is carried with this moving forward, and so it was so beautiful to see the community coming together. And so that is a part of the Agenda for a New Economy, these pieces that address the root causes, and so I’m excited to see how much more people power we can grow around this campaign to get winds that actually create change.

Mary Gray: Wow. So, I want to back up a second. Tell us a bit more about a recent collaboration where you felt technology was an important tool, but it was really the community organizing and the community engagement that was the magic of what you were doing. Let me start with you, Joanna.

Joanna Velazquez: Yeah, so, I will say, this entire pandemic experience, um, having to completely transition online, limited to only a few different times in which we were able to be in person. Like, technology has definitely shown up for us in a way that it’s allowed us to re-create our organizing infrastructure online, and still create places for folks to tap into, to help guide the work, to be directly involved with these campaigns—whether it’s to vision with us and spend time in our committee meetings. It’s allowed us to maintain our infrastructure, and I will say, like, that’s the biggest plus to it. And it’s even allowed us to tap into folks that maybe were only living online. Definitely a big learning lesson is, like, how do we continue to create online spaces? Digital organizing was a part of our work before, but it’s definitely become much more center to the way that we’re reaching folks and how we’re thinking about reaching folks and the intentionality that comes behind it. But I will say, the magic comes from the fact that when in those spaces, our folks are able to tap in, and so I will just say, like, technology’s biggest support has been about maintaining our infrastructure to keep meeting with folks, but it’s definitely within the meeting that the magic happens.

Mary Gray: Yeah. It’s almost, it feels like, you’re mainstreaming a way of using these tools for community action that maybe we didn’t see so deeply before. Um, Zachary, can I ask you a bit about, like, what’s a collaboration you’re involved in now that you really feel shows you the important role technology can play, but, really, its supporting role for the community organizing that you’re doing?

Zachary Rowe: Prior to sort of COVID or the COVID experience, we had limited use for technology only because, you know, our residents had limited technology. So technology really wasn’t a big component of what we do and how we do it kind of thing. We were sort of old-school, sort of the face-to-face meetings, phone calls, flyers, those kinds of things. But, when COVID hit, I mean that caused most all non-profits to have to sort of pivot and rethink the way that they sort of engaged community, and we were one of those. But I think for us it was harder because our infrastructure was not in place to actually do that, and probably even more importantly, our residents was not, you know, in a place where they sort of do that. So for us, you know, there was a lot of trying to take care of the basics. You know, do you have the Internet? Do you have a device? Do you know how to use a device? So, for us, it was a big learning curve in terms of the work, and don’t get me wrong. We’re not there yet. We’re not there yet. But we’re on the way, and you know, one of the things that Tawanna and I both talked about was the Community Tech Worker project, which came out of that, so I tell folks, “Never let a good crisis go to waste,” right? [LAUGHTER] And so within that COVID environment experience, I mean, we were able to sort of re-envision or re-imagine what this community can sort of be. Back in 2000, we actually envisioned a community where everyone had technology, everyone sort of was connected and using technology for work and for entertainment. We envisioned this—it just wasn’t possible. [LAUGHTER] The technology wasn’t there yet. And, also, I remember, you know, um, a year, year and a half ago, I actually emailed Tawanna sort of saying, “Hey, don’t you want to change the world?” And so, fortunately, she responded, and we’ve been working to at least change the world in Parkside. The magic for me is just working with residents to sort of see how they begin to realize that, yes, they can learn how to do this. Right? And sometimes it’s as simple as connecting to a Zoom meeting on their own without any help.

Mary Gray: Yeah. So, Tawanna, please share with us just what are some of these collaborations, and I can see, um, perhaps two of the co-conspirators that you work with, but maybe you want to share a bit more about what you’re working on these days that’s exciting to you.

Tawanna Dillahunt: Yeah, so definitely the most exciting projects, uh, you’ve heard about, um, from Zachary and Joanna. Um, other projects—there’s a collaboration with my, um, colleague, Professor Tiffany Veinot and a collaborator, uh, Patrick Shih at Indiana University Bloomington. I mentioned earlier that a lot of my work is around employment, and one barrier to employment is transportation. At least in Detroit, before COVID, transportation was a significant barrier. And, um, we began asking the question, you know, how are people overcoming the transportation barriers now, and how can technology amplify what it is that they’re doing already? And we thought of new models for transportation because I had done work where we onboarded people to Uber, and, um, technology was a barrier, right? They needed intermediaries to help them install the application and create a log-in account. Then some people didn’t have credit cards, right? And so, what are ways in which we can overcome those technological barriers? Again, we’re seeing this need for intermediaries. And Patrick Shih has done a lot of work with time banking, and we’ve seen how people are using time banks to share cars, share pickup trucks for moving, to, you know, get rides to the airport or to the grocery store or to healthcare appointments, or to work. So right now, we’re looking at how do we think about trust and reciprocity, and safety within a time-bank context to overcome transportation barriers? And looking at ways to update or build, you know, that, and, again, thinking about who the intermediaries might be in providing this type of support. So that’s another exciting project that I have going on.

Mary Gray: So definitely all of you innovate, you activate, you organize communities, and I’m just wondering if you could share with us what community innovation means to you. What does it look like on the ground to you? And let me maybe start with Tawanna.

Tawanna Dillahunt: Yeah, I think that’s a great question. And I think I can, start from Zachary’s, you know, introduction where he talked about being a kid and thinking about the perceptions of the kids who live in, uh, public housing, and they said, “Hey, we want to change this perception.” Innovation is painting the buildings. To me that’s innovation. Innovation is Zachary saying, “Hey, you want to change the world?” Right? Like, how do we go about building capacity in a community, right? How do we think through this Community Tech Workers, you know, concept? What does that look like, right? This is the community coming together with a challenge that they’re facing, bringing people together to work towards addressing that. No hierarchy, nothing, just sheer innovation, sheer problem-solving.

Mary Gray: I love that. I love that because I feel like you’re setting up for us that, you know, technology is really about creation, so what does it look like when people create together? So, Zachary, for you, could you just say a bit about, how do you define community innovation, especially when you’re explaining it to folks who maybe don’t see how technology would fit into that?

Zachary Rowe: So, I think, for me, just in terms of innovation, uh, one of the things that we’re always trying to do is solve problems, for the most part. Usually when you’re innovating, it’s because of something. You’re doing it for a reason. It’s not like you’re sitting there sort of saying, “Oh, well, I’m going to innovate today.” Okay, let me tell a story. Um, so, we had young—we had kids that was working with us for the summer. Every other day, they had to pass out flyers. And so they got tired of passing out flyers, and I said, “Well, if you guys can come up with a better way of getting the word out, I’m listening,” right? They came up with the idea of sending out text messages. I’m talking about 10 years ago, right? Now, the challenge with sending out text messages is that, you know, I really didn’t know a lot about sending out text messages, and also I was concerned about the cost, right? But they realized that they can use Gmail to send out text messages, because with Gmail, you use the phone number and the carrier, and it comes on your phone as a text message. For me, that was really innovative. They had a problem that they wanted to solve, which meant that they didn’t want to pass out the flyers, but they wanted to get the word out, and also there was this cost factor that they had to sort of think through, but that was really really creative, you know?

Mary Gray: I love that. And, Joanna, I wonder if you have some examples of just where you’ve seen folks innovate by really re-purposing the tools that are there, and where you see room for communities being able to set an agenda for what to do with technologies, how to re-purpose them to meet their needs.

Joanna Velazquez: It’s about, um, yeah, addressing a problem, right? Like, that’s where people get creative, is, like, something needs to happen. Every action has a reaction, right? [LAUGHTER] You know, this kind of happens a lot, but, like, really organically, right? Really organically, because, for me, it happens in a one-to-one where, like, I’m having a conversation with a member, and they’re talking to me about, you know, what’s their issue, what’s going on, you know, what’s—what’s really getting at you that you need it to change, and so our folks will share these stories, and then we’ll get to a point where it’s like, “Well, what do you want to do about it? How do we change it?” That is when we start talking about strategy. And so, I don’t know if that exactly is, like, re-purposing anything other than just, like, very critical thinking and, like, open conversation and dialogue with folks. So that to me is, like, how our folks really show and are active in, like, community innovation with the work, because it’s in a one-to-one where you are finding the real solutions to the problems—

Zachary Rowe: Mm-hmm.

Joanna Velazquez: —to the real problems that they’re actually facing.

Mary Gray: You’re bringing up for me how often, in computer science and engineering, the Holy Grail, the mantra is scale. Scale up, scale up. And what I hear you saying is, like, part of something being powerful and useful is also getting down to that nitty-gritty. It’s getting down to understanding, like, from, you know, one person at a time, the power of that change, and then you’ve got 700 people, like you were saying, showing up on a call.

Joanna Velazquez: Yeah.

Mary Gray: I mean that’s—I think that’s really powerful and an important intervention, maybe a course correction, for how we think about what success looks like when we’re engaging communities. I want to ask you all, and I wanted to direct this to Zachary and Tawanna, to maybe talk about the Community Tech Worker projects that you’re doing and the challenges—and also the opportunities—that you’re seeing coming out of that work. It strikes me as a good example of just that grappling with both how you scale but how you keep it real, where it’s meaningful scaling. So, if I could ask Zachary—would you tell us a bit about the Community Tech Worker project, and just set up for us what is it you’re trying to do? What are you aiming for? Where are there places where you’re hitting some hurdles and working through them?

Zachary Rowe: The Community Tech Worker project, for me, was an attempt to solve a problem. Um, earlier, I talked about the fact that during sort of the COVID pandemic, we realized that, you know, our residents didn’t have access to technology, and those who did have access to technology didn’t have the Internet. Uh, and if they did have the Internet, they didn’t have the skill. So the Community Tech Worker project was a way for us to begin to address those kinds of issues. One of the things that we realized is that the kind of skills that most people take for granted in terms of being able to use Zoom, being able to use email, uh, being to upload documents. I mean, for the most part, some of us take those things for granted, but there was a whole community of folks that did not have those skills, right? There was even a subpopulation that really didn’t have those skills. I’m talking about our seniors. And so what the Community Tech Worker project allowed us to do is begin to identify folks from the neighborhood who were interested in learning how to be Community Tech Workers. Now, I’m sort of saying interested in being a Community Tech Worker because we were—we did not identify the tech-y folks or the geeky folks, whatever. We sort of said, “Hey, come as you are,” and, well, we learned some—we got some lessons behind that, too, but—

Mary Gray: Okay. [LAUGHTER] You need to say a few of those lessons.

Zachary Rowe: Well, you know. Well, “Come as you are,” meaning you may not know how to turn the computer on, right? So—

Mary Gray: Yep, yep. That’s real.

Zachary Rowe: Exactly. Part of our understanding is that, “Hey, do we want to have a minimum skill level?” Like, “Hey, you got to at least know how to turn it on.” Or are we still going to look at folks—even if you don’t know how to cut it on—we still welcome you. So, we still have to figure that one out, right? But I think for me, it was important that we didn’t, like I said, identify the geeky folks who already knew how to do it because, you know, sometimes just because you know how to do it, they may not know how to teach it. Folks who are learning how to use technology for the first time is more sympathetic and more patient and more understanding of others, right? So, basically, like I said, my thing is to make sure that, uh, we work with residents to develop those basic skills, and I love how Tawanna talked about the project because she talked about this larger vision in terms of, you know, building those advanced skills. Right now, I’m just focusing on the basic skills, you know? So it’s nice to have her there sort of saying, “Hey, you know, they can do more, they can do more, they can do more.”

Tawanna Dillahunt: Yeah, I think we still need to work through this is, do we want to call it Community Tech Workers? Because for some, “tech” might be exclusive, right? They might not identify with “tech.” and so, you know, there’s a question of, who do we miss? You know, in the beginning, who felt excluded just by the way we framed, you know, this opportunity? The team definitely talked about this, um, do you need to come in with basic skillset? And just building on what Zachary said, you know, those who might not know how to turn on a computer, I mean, their strengths are—it’s the empathy, right? Because if you’re a “geek,” you might not be the best person to talk to people in patience. These are things that came out of our training, right? We need to know how to work with or speak with, you know, other community members and understand the questions that they have, and how do you identify what the problem might be. So, I mean, Zachary mentioned, you know, larger challenges. You know, I think good community work and collaborations—I mean, also as researchers, you know—when I think about collaborating with community partners, I think about sustainability, right? What happens if I’m no longer here? And even, you know, if the funding goes dry, what capacity did we build together, and how do we continue? You know, how do we continue on? So I’m thinking about, how do we sustain a role in the community? You know, maybe we call it Community Tech Workers. Maybe we call it, you know, um, Neighborhood Intermediaries. I’m not sure what we’ll call it. How does that role sustain itself? And, you know, think about funding long-term, thinking about opportunities. We’re collaborating with Community Health Workers, who, you know, need digital skills, too. I mean, arguably, we could, you know, maybe reach out to Ford Medical Center because telehealth is big. Some people are not sure how to log into tele-healthcare appointments. Or maybe online grocery delivery services would say, you know, “Maybe there’s a benefit if we had people who could support others in ordering.” If we had that, then maybe, you know, big business is always looking at revenue at the end of the day, so, like, how does this factor into there? What does building community digital capacity mean in the long-term, and how do we sustain these roles?

Mary Gray: I want to pick up that phrase you just put out there—community digital capacity. I actually want to really hold that up. I want to lift that up because community digital capacity, where I hear all of you talking about, that means boosting, lifting communities to do the work they’re doing. Like, I really hear that capacity building as this critical role that technologies could be playing that they haven’t really played yet. Like, we haven’t really given technologies a chance to, at least from the builder side, to fully be focused on how do we build communities’ capacity? So I’m saying this because one of the goals of the Project Resolve research that I’m doing right now that resonates with what I hear you all saying is: the goal is to think about how would you co-develop and support a coalition of community-based organizations, community healthcare workers, who have an idea of what their needs are, absolutely have an agenda, and they’re rarely ever given the chance to set that agenda when it comes to what tools are built for them to do their work and to own those tools and to fully use the data they collect as power—and that they can share with their communities. So, a big part of what we’re working on is thinking about the role of participatory action research, you know, community-based participatory design, all of these phrases we have that we throw around. I want to talk about what that looks like, because it’s—it’s really hard when you’re doing it right so—or trying to do it right. [LAUGHTER] So I would just love to hear you talk a bit about: what does that mean to you? What does that look like? Let me start with Joanna.

Joanna Velazquez: The project that Tawanna and I had did together really speaks to the way that I think about participatory research, is first things first, that I feel folks get wrong in spaces that I’m in—with campaign strategy and all this stuff is that people automatically want to go to, like, numbers and data-driven stuff and, ugh. But I just don’t understand how a conversation doesn’t bring much more—and I respect data, okay? Here’s the thing. I absolutely respect data. I don’t want to say that I don’t.

Mary Gray: [LAUGHTER] Respect the data.

Joanna Velazquez: I really do. But it’s within the lived experiences where the actual information is at. So when I think about participatory research and how that looks like in our work is, it’s absolutely by creating visioning spaces. Like, that gives us so much data by, like, what do people even care about? Like, are we even kicking up a campaign that matters? But, you know, even outside of visioning is just simply asking, like, you know, “On this question of housing, like, does that actually feel like it would meet your needs?” You know, “what are your needs?” The conversation that develops that, you know, creates that qualitative data, I think, is, like, where the magic is at. And then take that to figure out what the metrics, can, you know, support that or show where the cracks are, you know, that paints this bigger picture when we go into advocacy mode. Participatory research really starts in the conversations, in the meeting spaces, in the lived experiences that people are sharing.

Mary Gray: Ooh. Ooh, I love that. I love that in so many ways. Let me ask the same question to you, um, let me start with Tawanna. Especially knowing how computer science and engineering and particularly, um, human-centered design, human-computer interaction strives to think about participation, participatory design as what we should aim for, what does it mean to you, and how does it get rough when you’re in the thick of it?

Tawanna Dillahunt: Yeah. Um, you know, in our field, when we talk about participatory design, I think there’s an inherent outcome or expectation that we’re going to have a product or tangible output, like a user interface or some application. When I think about community-based participatory research, which comes out of the public health field, we’re thinking about the community, we’re equitable partners in the research, and we’re not really engaging unless there is a common goal, right? When I engage, you know, with the community, you know, I’m interested in creating jobs, interested in employment. Are there other organizations that are interested in, you know, new economies, new digital economies, or anyone else, who cares about, you know, access to healthy food or transportation? And you’re partnering because you have the same North Star, right?

Mary Gray: Mm-hmm.

Tawanna Dillahunt: And in this partnership, you know, you figure out, “Okay, here’s the general direction.” You might not have the exact—like, researchers come in with research questions, you know, and—

Mary Gray: [LAUGHTER] Yes.

Tawanna Dillahunt: —then you can say, “Well, yeah, if you address this research question, that’ll definitely be beneficial. It’ll help us, you know, understand these—these other things that we’re trying to get to,” but that’s not necessarily our core. We like it, but it might not be our core. And then when you’re engaging in community-based participatory research, it is a long-term process, right? You’re planning ahead. As a researcher, you have to address the research questions. We need to think about how this—how we can leverage these insights maybe to inform, you know, technology, but it’s not necessarily the outcome. Maybe we’re exploring existing technologies and exploring it in the context of a time bank, right? What changes need to be made to a time bank in order address the transportation needs of transportation-insecure communities, rural communities, that kind of thing? And so, that’s what, you know, community-based participatory research means to me, which is a little bit different from user-centered design and participatory design because you’re really going in with the technology-first approach.

Mary Gray: Yeah. No, and I feel like we’ve been discovering in our work, really, the first grant is about building trust, because there’s no reason anybody should trust anybody from coming outside of their communities, especially if they’re at all at the margins. And if we’re coming from a university and we don’t lead with, “How can I help you?” first, it understandably can create even more barriers. So yeah, I don’t think we give ourselves enough room to say, “The first stretch of time is let’s get to know each other and give you a reason to participate in anything I’m bringing.” So I want to ask Zachary—could you just tell us about the Detroit Urban Research Center and your definition of community-based participatory action research?

Zachary Rowe: Yeah. So, the Detroit UR—well, we call it the Detroit URC—um, for short. Um, so, basically, the Detroit URC, uh, started back in 1995 and in a nutshell, the URC focused on fostering health equity through community-based participatory research. Years ago, I didn’t really see the point of research or data, really. It is not that it wasn’t important. It was just how it was introduced to the community. Uh, and so, we were introduced to research by sort of the traditional research approach where you had researchers coming to the community, pretty much have their way or do whatever they wanted, and leave, right? They rarely shared the data. They rarely, you know, asked us any questions. They rarely involved the community. So, basically, they would come in with their survey, with their questions, get their answers, and leave. We won’t hear from them again until the next project, right? And so, to be honest, we were pretty much soured on the whole idea of research for years until, you know, folks from the University of Michigan School of Public Health, you know, came to Detroit talking to community groups about this thing called CBPR. Uh, we’d never heard of it before, but we was intrigued by the fact that whole idea behind CBPR is that the community partners are equal partner in the research, from developing the initial research question to disseminating the results, and everything in between. And so this was a different way of doing research that really appealed to community partners. You know, it definitely appealed to us, uh, because we were at the table—sometimes agreeing and sometimes disagreeing with some of the research stuff, but that was okay, though, because we were all equal partners. Um, you know, I value research now, but I value CBPR research more than others, though, just because we’re at the table, right?

Mary Gray: Hmm. Tawanna, did you want to jump in?

Tawanna Dillahunt: No, I totally agree. I remember, um, sharing with my class, uh, last year Zachary’s video [LAUGHTER] on, uh, community-based participatory research where he—I think we’re at a potluck together, and—you know, you bring your own dish, and—and everybody else brings their dishes, and we can enjoy a meal together. If you don’t like the greens, you know, you can stay away from the greens, you know, but we’re all eating here. Like, I thought Zachary was going to go there. I love that analogy.

Zachary Rowe: Okay. Well, thank you.

Mary Gray: We’ll just have to make sure that link’s available. [LAUGHTER] I think that would be a great thing to put on the podcast. And I want to bring up what I feel like we have to talk about, and I was going to ask Joanna if you would maybe lead us off in thinking about how power differentials factor into this work. For example, I’m a white woman working with a group of predominantly Black and brown community members, many people undocumented, all of them doing community health. My biggest connection is my mom was a nurse, so I understand some of that world, but I would love to talk about how we strive for that sense of equality. We’re also navigating power differentials that come from our institutions. So, maybe if, yeah, you’d want to speak to that.

Joanna Velazquez: One of the values that we hold that I’ve been trained on is just, like, the people closest to the issue know their solution. The people furthest from it, you know, can theorize and get all philosophical, but it’s not coming from a lived experience. So that shows up a lot in conversations where, uh, you know, we’re trying to all get alignment, build coalition, build power, and, like, people operate differently, and people haven’t done the same type of, you know, conscious thinking or unpacking of their own internalized white supremacy or capitalism or patriarchy. Um, Detroit Action is an anti-capitalist organization, and so that comes up a lot in our work, in our strategy, in the way that we’re building with folks, because we’re all at different levels from our own perspectives. But it’s really important to hold onto the value, right, of, like, those closest to the issue know the solution, because if we stay there then it makes ego getting checked at the door just a little bit easier because we’re grounded on that same value. And so I would say, like, this comes up a lot in so many different ways, but for me, as I do my work, like I said, it has to go back to that one-to-one for me because my members are working class. My members don’t have the technological access to these meetings. They can’t always tap in really quick. And so in these one-on-ones, it’s where I can utilize their time to our best agreement, really, on, like, how to move this work forward, and it’s where their stories can guide the work, and that’s where I can build trust with them, because I work in the largest Black city of America. Like, I’m not a Black person. I cannot speak for the Black community, but what I can do is utilize my time to talk with all my members to know that their stories are guiding this work. And so that’s what I do, and that’s what I have to do, and create the meeting spaces where they can continue to guide the work, whether it’s visioning, whether it’s the committee space to make the decision, whether it’s the one-to-one because we just need to talk and I need to get your input on how this supposed to go. You know, and it comes down to that. For me, it comes down to that. That’s how we address, like, this power stuff, but it comes up in so many different ways. Um, the amount of racial scapegoating that we have to experience as a Black and brown city from our elected officials or the media for painting narratives that it’s on us to turn out for the results of some type of election, X, Y, Z. It comes up in so many different ways. We’re constantly battling it, but it’s our—I think it’s our values that keep us at least principled in our struggle, because we are going to struggle. We are going to mess up. We do need the feedback. We do need to be able to manage up, horizontally, whatever the case is. Membership is included in that. Like, it’s not just staff. So, you know, being able to at least create the safe spaces to be uncomfortable is the thing in which we are able to, like, address power dynamics in these relationships and systems.

Mary Gray: Okay, just a quick follow-up. And I’ll direct it at Tawanna and Zachary, just to be able to build on what Joanna is saying here. Where have you seen in your work this effort of putting folks closest to the problem, who have their solutions, in the driver’s seat for taking on the technology piece of that, for being able to build something that supports the solutions they already have?

Tawanna Dillahunt: Yeah. I mean, I think it goes back to co-designing. And this is kind of like once you figure out what the technology is, you’ve come up with this “solution” together, then I think that’s when the developer can step in, and it’s a matter of co-designing. It’s that agile approach where, “Okay, here’s how I understand it. Let me create this,” or “Let me conceptualize it in a prototype way, and, you know, this back-and-forth communication. Is this what we’re seeing?” This is, you know, some example of our past work when creating dream gigs with job seekers and having the job seekers see, “Oh, yeah, this is exactly what I need. And, oh, by the way, if there’s a way to connect this, can you tell us how we can access, you know, volunteer work so that we can build our skills? That would be amazing.” Right? And so, we’re building it together, you know, the co-designing and co-development, and they might not be programming, but they’re looking at the output and talking to the developer, or at least seeing the output, the outcome of the developer, and say, “Yes, this is what I was asking for,” or, “No, no, no, no, this is not what I was asking for.” But it takes a lot of work up front to get to that point, I think.

Mary Gray: So, how should researchers compensate—like, really recognize the value that community members are putting in? Like, what is a way to really, genuinely honor and compensate the contributions community members are making to development? Let me ask that of you, Zachary. Like, what’s the best way to show up?

Zachary Rowe: Well, uh, for me, I would say the best way is to ask. You know, I mean, for some, it may just be monetary. They may just want cash, or they just may want credit. I would just ask the community, “How you want your contribution to be recognized,” and be willing to do it, you know? I just want to go back to a question you asked earlier, um, power. And one of the things that I’ve learned over time is to understand the power you do have and use it, right? One thing that all research projects have in mind is the need for data, and if they’re collecting the data from the community, then that’s your power, because community folks can say, “No, we don’t want to participate.” Right? So, you know, I know that sounds kind of simplistic, but it works. [LAUGHTER]

Mary Gray: Yeah.

Zachary Rowe: And so, once you understand where you power is and you use it, then it begins to have an impact. Then also, one of the other things that I realized, our researchers that we work with are wonderful. Tawanna is wonderful, right? But it’s not Tawanna that’s the problem. Sometimes it’s the university infrastructure, right? It’s the county department. Maybe it’s, you know, maybe it’s the IRB. I mean, there are other issues that really don’t get and don’t understand why community partners are a part of the research team or why they’re on the project.

Mary Gray: So, I want to ask you, what are some future projects you’re most excited about heading into 2022? What is keeping you excited about pushing forward? Let me start with Tawanna.

Tawanna Dillahunt: Yeah. Definitely the Community Tech Workers work, and I have a student, uh, Alex Lu, who’s working on understanding residents’ perceptions of safety alongside Project Greenlight in Detroit, and so he’s going to take a photo-voice approach as a way to capture community narratives of safety and kind of exhibit these photos once we’re there, and he’s also, um, extending this to video voice, which might be a little bit more complex, but there’s a methodological understanding of how video voice might work in a community context, given that we can take videos over our phone.

Mary Gray: Wow. And how about you, Zachary? What are you excited about for 2022?

Zachary Rowe: Uh, definitely we’re excited about making sure that residents of Parkside develop those basic skills to be able to navigate the online world, right? Also, I’m excited about another project I’m working on called Deciders, whereby we’re developing an online tool that allows communities to set their own priorities.

Mary Gray: Joanna, what’s coming up in 2022?

Joanna Velazquez: 2022 is a big year. It’s a big, big year. It’s a midterm year, midterm election, so, um, maybe not necessarily excited about election season, but I’m excited to see how our members tap in and weigh in and, like Zachary said, power is simply just acting, and so how are we going to use this moment to seize our power? What are the actions we’re going to take to drive our Agenda for a New Economy forward, but also to defend Black voters? We’re a part of a coalition to defend the Black vote in Michigan. It is definitely under attack, and it’s unfortunate, but corporate actors are involved, and so we’re asking them to no longer fund these folks, um, that are putting these 39 voter suppression bills forward in the state of Michigan, which is so unfortunate, and now trying to sidestep the governor with a voter-suppression ballot initiative called Secure MI Vote. Um, “Suppress MI Vote” is what we rename it, but, yeah, there’s a lot of things that we’re tapped into this year, but definitely excited for how our members show up in this election.

Zachary Rowe: Show up and show out, right?

Joanna Velazquez: Show up and show out, you’ve got it.

Mary Gray: Thank you. Thank you, thank you.

[MUSIC STARTS OVER DIALOGUE]

Okay, I’m going to just take a second to thank the three of you for joining me today, and I want more. I hope we get to have another conversation, but thanks for sharing your work with us.

Zachary Rowe: Thank you.

Tawanna Dillahunt: Thank you.

Joanna Velazquez: Thank you for having us. Thank you.

Mary Gray:
And thanks to our listeners for tuning in. If you’d like to learn more about community-driven innovation, check out the other episodes in our “Just Tech” series. Also, be sure to subscribe for new episodes of the Microsoft Research Podcast wherever you listen to your favorite shows.

The post Just Tech: Centering Community-Driven Innovation at the Margins episode 2 with Dr. Tawanna Dillahunt, Zachary Rowe, and Joanna Velazquez appeared first on Microsoft Research.

Read More

Animation showing the process of how encrypted data is transferred between the GPU drive and the GPU through a secure channel. The GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA A100 Tensor Core GPU work together to achieve end-to-end encryption of data transfers

Powering the next generation of trustworthy AI in a confidential cloud using NVIDIA GPUs

Animation showing the process of how encrypted data is transferred between the GPU drive and the GPU through a secure channel. The GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA A100 Tensor Core GPU work together to achieve end-to-end encryption of data transfers

Cloud computing is powering a new age of data and AI by democratizing access to scalable compute, storage, and networking infrastructure and services. Thanks to the cloud, organizations can now collect data at an unprecedented scale and use it to train complex models and generate insights.  

While this increasing demand for data has unlocked new possibilities, it also raises concerns about privacy and security, especially in regulated industries such as government, finance, and healthcare. One area where data privacy is crucial is patient records, which are used to train models to aid clinicians in diagnosis. Another example is in banking, where models that evaluate borrower creditworthiness are built from increasingly rich datasets, such as bank statements, tax returns, and even social media profiles. This data contains very personal information, and to ensure that it’s kept private, governments and regulatory bodies are implementing strong privacy laws and regulations to govern the use and sharing of data for AI, such as the General Data Protection Regulation (GDPR) and the proposed EU AI Act. You can learn more about some of the industries where it’s imperative to protect sensitive data in this Microsoft Azure Blog post.

Commitment to a confidential cloud

Microsoft recognizes that trustworthy AI requires a trustworthy cloud—one in which security, privacy, and transparency are built into its core. A key component of this vision is confidential computing—a set of hardware and software capabilities that give data owners technical and verifiable control over how their data is shared and used. Confidential computing relies on a new hardware abstraction called trusted execution environments (TEEs). In TEEs, data remains encrypted not just at rest or during transit, but also during use. TEEs also support remote attestation, which enables data owners to remotely verify the configuration of the hardware and firmware supporting a TEE and grant specific algorithms access to their data.  

At Microsoft, we are committed to providing a confidential cloud, where confidential computing is the default for all cloud services. Today, Azure offers a rich confidential computing platform comprising different kinds of confidential computing hardware (Intel SGX, AMD SEV-SNP), core confidential computing services like Azure Attestation and Azure Key Vault managed HSM, and application-level services such as Azure SQL Always Encrypted, Azure confidential ledger, and confidential containers on Azure. However, these offerings are limited to using CPUs. This poses a challenge for AI workloads, which rely heavily on AI accelerators like GPUs to provide the performance needed to process large amounts of data and train complex models.  

The Confidential Computing group at Microsoft Research identified this problem and defined a vision for confidential AI powered by confidential GPUs, proposed in two papers, “Oblivious Multi-Party Machine Learning on Trusted Processors” and “Graviton: Trusted Execution Environments on GPUs.” In this post, we share this vision. We also take a deep dive into the NVIDIA GPU technology that’s helping us realize this vision, and we discuss the collaboration among NVIDIA, Microsoft Research, and Azure that enabled NVIDIA GPUs to become a part of the Azure confidential computing ecosystem.

Vision for confidential GPUs

Today, CPUs from companies like Intel and AMD allow the creation of TEEs, which can isolate a process or an entire guest virtual machine (VM), effectively eliminating the host operating system and the hypervisor from the trust boundary. Our vision is to extend this trust boundary to GPUs, allowing code running in the CPU TEE to securely offload computation and data to GPUs.  

Diagram showing the trust boundary extended from the host trusted execution environment of the CPU to the trusted execution environment of the GPU through a secure channel.
Figure 1: Vision for confidential computing with NVIDIA GPUs.

Unfortunately, extending the trust boundary is not straightforward. On the one hand, we must protect against a variety of attacks, such as man-in-the-middle attacks where the attacker can observe or tamper with traffic on the PCIe bus or on a NVIDIA NVLink connecting multiple GPUs, as well as impersonation attacks, where the host assigns an incorrectly configured GPU, a GPU running older versions or malicious firmware, or one without confidential computing support for the guest VM. At the same time, we must ensure that the Azure host operating system has enough control over the GPU to perform administrative tasks. Furthermore, the added protection must not introduce large performance overheads, increase thermal design power, or require significant changes to the GPU microarchitecture.  

Our research shows that this vision can be realized by extending the GPU with the following capabilities:

  • A new mode where all sensitive state on the GPU, including GPU memory, is isolated from the host
  • A hardware root-of-trust on the GPU chip that can generate verifiable attestations capturing all security sensitive state of the GPU, including all firmware and microcode 
  • Extensions to the GPU driver to verify GPU attestations, set up a secure communication channel with the GPU, and transparently encrypt all communications between the CPU and GPU 
  • Hardware support to transparently encrypt all GPU-GPU communications over NVLink  
  • Support in the guest operating system and hypervisor to securely attach GPUs to a CPU TEE, even if the contents of the CPU TEE are encrypted

Confidential computing with NVIDIA A100 Tensor Core GPUs

NVIDIA and Azure have taken a significant step toward realizing this vision with a new feature called Ampere Protected Memory (APM) in the NVIDIA A100 Tensor Core GPUs. In this section, we describe how APM supports confidential computing within the A100 GPU to achieve end-to-end data confidentiality.  

APM introduces a new confidential mode of execution in the A100 GPU. When the GPU is initialized in this mode, the GPU designates a region in high-bandwidth memory (HBM) as protected and helps prevent leaks through memory-mapped I/O (MMIO) access into this region from the host and peer GPUs. Only authenticated and encrypted traffic is permitted to and from the region.  

In confidential mode, the GPU can be paired with any external entity, such as a TEE on the host CPU. To enable this pairing, the GPU includes a hardware root-of-trust (HRoT). NVIDIA provisions the HRoT with a unique identity and a corresponding certificate created during manufacturing. The HRoT also implements authenticated and measured boot by measuring the firmware of the GPU as well as that of other microcontrollers on the GPU, including a security microcontroller called SEC2. SEC2, in turn, can generate attestation reports that include these measurements and that are signed by a fresh attestation key, which is endorsed by the unique device key. These reports can be used by any external entity to verify that the GPU is in confidential mode and running last known good firmware.  

When the NVIDIA GPU driver in the CPU TEE loads, it checks whether the GPU is in confidential mode. If so, the driver requests an attestation report and checks that the GPU is a genuine NVIDIA GPU running known good firmware. Once confirmed, the driver establishes a secure channel with the SEC2 microcontroller on the GPU using the Security Protocol and Data Model (SPDM)-backed Diffie-Hellman-based key exchange protocol to establish a fresh session key. When that exchange completes, both the GPU driver and SEC2 hold the same symmetric session key.  

The GPU driver uses the shared session key to encrypt all subsequent data transfers to and from the GPU. Because pages allocated to the CPU TEE are encrypted in memory and not readable by the GPU DMA engines, the GPU driver allocates pages outside the CPU TEE and writes encrypted data to those pages. On the GPU side, the SEC2 microcontroller is responsible for decrypting the encrypted data transferred from the CPU and copying it to the protected region. Once the data is in high bandwidth memory (HBM) in cleartext, the GPU kernels can freely use it for computation.

Diagram showing how the GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA Ampere GPU work together to achieve end-to-end encryption of data transfers.
Figure 2: The GPU driver on the host CPU and the SEC2 microcontroller on the NVIDIA A100 Tensor Core GPU work together to achieve end-to-end encryption of data transfers.

Accelerating innovation with confidential AI

The implementation of APM is an important milestone toward achieving broader adoption of confidential AI in the cloud and beyond. APM is the foundational building block of Azure Confidential GPU VMs, now in private preview. These VMs, designed in collaboration with NVIDIA, Azure, and Microsoft Research, feature up to four A100 GPUs with 80 GB of HBM and APM technology and enable users to host AI workloads on Azure with a new level of security.  

But this is just the beginning. We look forward to taking our collaboration with NVIDIA to the next level with NVIDIA’s Hopper architecture, which will enable customers to protect both the confidentiality and integrity of data and AI models in use. We believe that confidential GPUs can enable a confidential AI platform where multiple organizations can collaborate to train and deploy AI models by pooling together sensitive datasets while remaining in full control of their data and models. Such a platform can unlock the value of large amounts of data while preserving data privacy, giving organizations the opportunity to drive innovation.  

A real-world example involves Bosch Research, the research and advanced engineering division of Bosch, which is developing an AI pipeline to train models for autonomous driving. Much of the data it uses includes personal identifiable information (PII), such as license plate numbers and people’s faces. At the same time, it must comply with GDPR, which requires a legal basis for processing PII, namely, consent from data subjects or legitimate interest. The former is challenging because it is practically impossible to get consent from pedestrians and drivers recorded by test cars. Relying on legitimate interest is challenging too because, among other things, it requires showing that there is a no less privacy-intrusive way of achieving the same result. This is where confidential AI shines: Using confidential computing can help reduce risks for data subjects and data controllers by limiting exposure of data (for example, to specific algorithms), while enabling organizations to train more accurate models.   

At Microsoft Research, we are committed to working with the confidential computing ecosystem, including collaborators like NVIDIA and Bosch Research, to further strengthen security, enable seamless training and deployment of confidential AI models, and help power the next generation of technology.

About confidential computing at Microsoft Research  

The Confidential Computing team at Microsoft Research Cambridge conducts pioneering research in system design that aims to guarantee strong security and privacy properties to cloud users. We tackle problems around secure hardware design, cryptographic and security protocols, side channel resilience, and memory safety. We are also interested in new technologies and applications that security and privacy can uncover, such as blockchains and multiparty machine learning. Please visit our careers page to learn about opportunities for both researchers and engineers. We’re hiring.

Related GTC Conference sessions

The post Powering the next generation of trustworthy AI in a confidential cloud using NVIDIA GPUs appeared first on Microsoft Research.

Read More

Z-code multilingual model representation diagram

Microsoft Translator enhanced with Z-code Mixture of Experts models

Z-code multilingual model representation diagram

Translator, a Microsoft Azure Cognitive Service, is adopting Z-code Mixture of Experts models, a breakthrough AI technology that significantly improves the quality of production translation models. As a component of Microsoft’s larger XYZ-code initiative to combine AI models for text, vision, audio, and language, Z-code supports the creation of AI systems that can speak, see, hear, and understand. This effort is a part of Azure AI and Project Turing, focusing on building multilingual, large-scale language models that support various production teams. Translator is using NVIDIA GPUs and Triton Inference Server to deploy and scale these models efficiently for high-performance inference. Translator is the first machine translation provider to introduce this technology live for customers.

Z-code MoE boosts efficiency and quality

Z-code models utilize a new architecture called Mixture of Experts (MoE), where different parts of the models can learn different tasks. The models learn to translate between multiple languages at the same time. The Z-code MoE model utilizes more parameters while dynamically selecting which parameters to use for a given input. This enables the model to specialize a subset of the parameters (experts) during training. At runtime, the model uses the relevant experts for the task, which is more computationally efficient than utilizing all model’s parameters.

animated graphic showing Z-code MoE model translating from English to French
Figure 1: Z-code MoE model translating from English to French. The model dynamically selects subsets of its parameters to be utilized for each input.

Newly introduced Z-code MoE models leverage transfer learning, which enables efficient knowledge sharing across similar languages. Moreover, the models utilize both parallel and monolingual data during the training process. This opens the way to high quality machine translation beyond the high-resource languages and improves the quality of low-resource languages that lack significant training data. This approach can provide a positive impact on AI fairness, since both high-resource and low-resource languages see improvements.

We have trained translation systems for research purposes with 200 billion parameters supporting 100 language pairs. Though such large systems significantly improved the translation quality, this also introduced challenges to deploy them in a production environment cost effectively. For our production model deployment, we opted for training a set of 5 billion parameter models, which are 80 times larger than our currently deployed models. We trained a multilingual model per set of languages, where each model can serve up to 20 language pairs and therefore replace up to 20 of the current systems. This enabled our model to maximize the transfer learning among languages while being deployable with effective runtime cost. We compared the quality improvements of the new MoE to the current production system using human evaluation. The figure below shows the results of the models on various language pairs. The Z-code-MoE systems outperformed individual bilingual systems, with average improvements of 4%. For instance, the models improved English to French translations by 3.2 percent, English to Turkish by 5.8 percent, Japanese to English by 7.6 percent, English to Arabic by 9.3 percent, and English to Slovenian by 15 percent.

graphic showing quality gains of Z-code MoE models over existing models. Languages are ordered by training data sizes.
Figure 2: Quality gains of Z-code MoE models over existing models. Languages are ordered by training data sizes.

Training large models with billions of parameters is challenging. The Translator team collaborated with Microsoft DeepSpeed to develop a high-performance system that helped train massive scale Z-code MoE models, enabling us to efficiently scale and deploy Z-code models for translation.

We partnered with NVIDIA to optimize faster engines that can be used at runtime to deploy the new Z-code/MoE models on GPUs. NVIDIA developed custom CUDA kernels and leveraged the CUTLASS and FasterTransformer libraries to efficiently implement MoE layers on a single V100 GPU. This implementation achieved up to 27x throughput improvements over standard GPU (PyTorch) runtimes. We used NVIDIA’s open source Triton Inference Server to serve Z-code MoE models. We used Triton’s dynamic batching feature to pool several requests into a big batch for higher throughput that enabled us to ship large models with relatively low runtime costs.

How can you use the new Z-code models?

Z-code models are available now by invitation to customers using Document Translation, a feature that translates entire documents, or volumes of documents, in a variety of different file formats preserving their original formatting. Z-code models will be made available to all customers and to other Translator products in phases. Please fill out this form to request access to Document Translation using Z-code models.

Learn more

Acknowledgements

The following people contributed to this work: Abdelrahman Abouelenin, Ahmed Salah, Akiko Eriguchi, Alex Cheng, Alex Muzio, Amr Hendy, Arul Menezes, Brad Ballinger, Christophe Poulain, Evram Narouz, Fai Sigalov, Hany Hassan Awadalla, Hitokazu Matsushita, Mohamed Afify, Raffy Bekhit, Rohit Jain, Steven Nguyen, Vikas Raunak, Vishal Chowdhary, and Young Jin Kim.

The post Microsoft Translator enhanced with Z-code Mixture of Experts models appeared first on Microsoft Research.

Read More

Photo of a quantum computer close-up

Microsoft has demonstrated the underlying physics required to create a new kind of qubit

Photo of a quantum computer close-up

Quantum computing promises to help us solve some of humanity’s greatest challenges. Yet as an industry, we are still in the early days of discovering what’s possible. Today’s quantum computers are enabling researchers to do interesting work. However, these researchers often find themselves limited by the inadequate scale of these systems and are eager to do more. Today’s quantum computers are based on a variety of qubit types, but none so far have been able to scale to enough qubits to fully realize the promise of quantum.

Microsoft is taking a more challenging, but ultimately more promising approach to scaled quantum computing with topological qubits that are theorized to be inherently more stable than qubits produced with existing methods without sacrificing size or speed. We have discovered that we can produce the topological superconducting phase and its concomitant Majorana zero modes, clearing a significant hurdle toward building a scaled quantum machine. The explanation of our work and methods below shows that the underlying physics behind a topological qubit are sound—the observation of a 30 μeV topological gap is a first in this work, and one that lays groundwork for the potential future of topological quantum computing. While engineering challenges remain, this discovery proves out a fundamental building block for our approach to a scaled quantum computer and puts Microsoft on the path to deliver a quantum machine in Azure that will help solve some of the world’s toughest problems.

Dr. Chetan Nayak and Dr. Sankar Das Sarma recently sat down to discuss these results and why they matter in the video below. Learn more about our journey and visit Azure Quantum to get started with quantum computing today.

Dr. Sankar Das Sarma, a Distinguished University Professor of Physics at University of Maryland joins Dr. Chetan Nayak, Distinguished Engineer of Quantum at Microsoft to discuss Microsoft’s unique approach to building a fully scalable quantum machine.

Microsoft Quantum team reports observation of a 30 μeV topological gap in indium arsenide-aluminum heterostructures

Topological quantum computation is a route to hardware-level fault tolerance, potentially enabling a quantum computing system with high fidelity qubits, fast gate operations, and a single module architecture. The fidelity, speed, and size of a topological qubit is controlled by a characteristic energy called the topological gap. This path is only open if one can reliably produce a topological phase of matter and experimentally verify that the sub-components of a qubit are in a topological phase (and ready for quantum information processing). Doing so is not trivial because topological phases are characterized by the long-ranged entanglement of their ground states, which is not readily accessible to conventional experimental probes.

This difficulty was addressed by the “topological gap protocol” (TGP), which our team set forth a year ago as a criterion for identifying the topological phase with quantum transport measurements. Topological superconducting wires have Majorana zero modes at their ends. There is a real fermionic operator localized at each end of the wire, analogous to the real fermionic wave equation constructed by Ettore Majorana in 1937.

Consequently, there are two quantum states of opposite fermion parity that can only be measured through a phase-coherent probe coupled to both ends. In electrical measurements, the Majorana zero modes (see Figure 1) cause zero-bias peaks (ZBPs) in the local conductance. However, local Andreev bound states and disorder can also cause zero-bias peaks. Thus, the TGP focuses on ZBPs that are highly stable and, crucially, uses the non-local conductance to detect a bulk phase transition. Such a transition must be present at the boundary between the trivial superconducting phase and the topological phase because these are two distinct phases of matter, as different as water and ice.

Quantum computing: Topo phase
Figure 1: The local density of states of a topological superconducting nanowire as a function of energy and position.

We have simulated our devices using models that incorporate the details of the materials stack, geometry, and imperfections. Our simulations have demonstrated that the TGP is a very stringent criterion, rendering it a reliable method for detecting the topological phase in a device. Crucially, the conditions for passing the protocol—the presence of stable ZBPs at both ends of the device over a gapped region with gapless boundary, as established via the non-local conductance—were established before any devices had been measured. Given the subtleties involved in identifying a topological phase, which stem from the absence of a local order parameter, one of the design principles of the TGP was to avoid confirmation bias. In particular, the device is scanned over its entire operating range instead of ‘hunting’ for a specific desired feature, such as a ZBP.

Microsoft’s Station Q, in Santa Barbara, CA, is the birthplace of Microsoft’s quantum program. For the last 16 years, it has been the host of a biannual conference on topological phases and quantum computing. After a two-year hiatus of in-person meetings due to the pandemic, the Station Q meetings resumed in early March. At this meeting with leaders in quantum computing from across industry and academia, we reported that we have multiple devices that have passed the TGP.

Our team has measured topological gaps exceeding 30 μeV. This is more than triple the noise level in the experiment and larger than the temperature by a similar factor. This shows that it is a robust feature. This is both a landmark scientific advance and a crucial step on the journey to topological quantum computation, which relies on the fusion and braiding of anyons (the two primitive operations on topological quasiparticles). The topological gap controls the fault-tolerance that the underlying state of matter affords to these operations. More complex devices enabling these operations require multiple topological wire segments and rely on TGP as part of their initialization procedure. Our success was predicated on very close collaboration between our simulation, growth, fabrication, measurement, and data analysis teams. Every device design was simulated in order to optimize it over 23 different parameters prior to fabrication. This enabled us to determine the device tuning procedure during design.

Our results are backed by exhaustive measurements and rigorous data validation procedures. We obtained the large-scale phase diagram of multiple devices, derived from a combination of local and non-local conductances. Our analysis procedure was validated on simulated data in which we attempted to fool the TGP. This enabled us to rule out various null hypotheses with high confidence. Moreover, data analysis was led by a different team than the one who took the data, as part of our checks and balances between different groups within the team. Additionally, an expert council of independent consultants is vetting our results, and the response to date is overwhelmingly positive.

With the underlying physics demonstrated, the next step is a topological qubit. We hypothesize that the topological qubit will have a favorable combination of speed, size, and stability compared to other qubits. We believe ultimately it will power a fully scalable quantum machine in the future, which will in turn enable us to realize the full promise of quantum to solve the most complex and pressing challenges our society faces.

The post Microsoft has demonstrated the underlying physics required to create a new kind of qubit appeared first on Microsoft Research.

Read More

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard.

PeopleLens: Using AI to support social interaction between children who are blind and their peers

A young boy wearing the PeopleLens sits on the floor of a playroom holding a blind tennis ball in his hands. His attention is directed toward a woman sitting on the floor in front of him holding her hands out. The PeopleLens looks like small goggles that sit on the forehead. The image is marked with visual annotations to indicate what the PeopleLens is seeing and what sounds are being heard.
The PeopleLens is a new research technology designed to help people who are blind or have low vision better understand their immediate social environments by locating and identifying people in the space. Coupled with a scheme of work based on research and practices from psychology and speech and language therapy, the system can help children and young people who are blind more easily forge social connections with their peers.

For children born blind, social interaction can be particularly challenging. A child may have difficulty aiming their voice at the person they’re talking to and put their head on their desk instead. Linguistically advanced young people may struggle with maintaining a topic of conversation, talking only about something of interest to them. Most noticeably, many children and young people who are blind struggle with engaging and befriending those in their age group despite a strong desire to do so. This is often deeply frustrating for the child or young person and can be equally so for their support network of family members and teachers who want to help them forge these important connections.

  • PUBLICATION

    PeopleLens


    The PeopleLens is an open-ended AI system that offers people who are blind or who have low vision further resources to make sense of and engage with their immediate social surroundings.

The PeopleLens is a new research technology that we’ve created to help young people who are blind (referred to as learners in our work) and their peers interact more easily. A head-worn device, the PeopleLens reads aloud in spatialized audio the names of known individuals when the learner looks at them. That means the sound comes from the direction of the person, assisting the learner in understanding both the relative position and distance of their peers. The PeopleLens helps learners build a People Map, a mental map of those around them needed to effectively signal communicative intent. The technology, in turn, indicates to the learner’s peers when the peers have been “seen” and can interact—a replacement for the eye contact that usually initiates interaction between people.

For children and young people who are blind, the PeopleLens is a way to find their friends; however, for teachers and parents, it’s a way for these children and young people to develop competence and confidence in social interaction. An accompanying scheme of work aims to guide the development of spatial attention skills believed to underpin social interaction through a series of games that learners using the PeopleLens can play with peers. It also sets up situations in which learners can experience agency in social interaction. A child’s realization that they can choose to initiate a conversation because they spot someone first or that they can stop a talkative brother from speaking by looking away is a powerful moment, motivating them to delve deeper into directing their own and others’ attention.

The PeopleLens is an advanced research prototype that works on Nreal Light augmented reality glasses tethered to a phone. While it’s not available for purchase, we are recruiting learners in the United Kingdom aged 5 to 11 who have the support of a teacher to explore the technology as part of a multistage research study. For the study, led by the University of Bristol, learners will be asked to use the PeopleLens for a three-month period beginning in September 2022. For more information, visit the research study information page. 

Research foundation 

The scheme of work, coauthored by collaborators Professor Linda Pring and Dr. Vasiliki Kladouchou, draws on research and practice from psychology and speech and language therapy in providing activities to do with the technology. The PeopleLens builds on the hypothesis that many social interaction difficulties for children who are blind stem from differences in the ways children with and without vision acquire fundamental attentional processes as babies and young children. For example, growing up, children with vision learn to internalize a joint visual dialogue of attention. A young child points at something in the sky, and the parent says, “Bird.” Through these dialogues, young children learn how to direct the attention of others. However, there isn’t enough research to understand how joint attention manifests in children who are blind. A review of the literature suggests that most research doesn’t account for a missing sense and that research specific to visual impairment doesn’t provide a framework for joint attention beyond the age of 3. We’re carrying out research to better understand how the development of joint attention can be improved in early education and augmented with technology.

How does the PeopleLens work? 

The PeopleLens is a sophisticated AI prototype system that is intended to provide people who are blind or have low vision with a better understanding of their immediate social environment. It uses a head-mounted augmented reality device in combination with four state-of-the-art computer vision algorithms to continuously locate, identify, track, and capture the gaze directions of people in the vicinity. It then presents this information to the wearer through spatialized audio—sound that comes from the direction of the person. The real-time nature of the system gives a sense of immersion in the People Map.

A graphic overview of the PeopleLens system describes its functionality and experience features with accompanying icons.
The PeopleLens helps the child wearing it build a mental map of those in their immediate social environment. Because the PeopleLens reads aloud the names of identified people in spatialized audio, the child is able to get a sense of the respective positions and distances of their peers. The system receives images and processes them with computer vision algorithms, as shown by the overlays on the top images in this screenshot of the PeopleLens development environment. The system then stiches together a world map that’s used to drive the experiences, as shown at the bottom right.

The PeopleLens is a ground-breaking technology that has also been designed to protect privacy. Among the algorithms underpinning the system is facial recognition of people who’ve been registered in the system. A person registers by taking several photographs of themselves with the phone attached to the PeopleLens. Photographs aren’t stored, instead converted into a vector of numbers that represent a face. These differ from any vectors used in other systems, so recognition by the PeopleLens doesn’t lead to recognition by any other system. No video or identifying information is captured by the system, ensuring that the images can’t be maliciously used.

The system employs a series of sounds to assist the wearer in placing people in the surrounding space: A percussive bump indicates when their gaze has crossed a person up to 10 meters away. The bump is followed by the person’s name if the person is registered in the system, is within 4 meters of the wearer, and both the person’s ears can be detected. The sound of woodblocks guides the wearer in finding and centering the face of a person the system has seen for 1 second but hasn’t identified, changing in pitch to help the wearer adjust their gaze accordingly. (Those people who are unregistered are acknowledged with a click sound.) Gaze notification can alert the wearer to when they’re being looked at. 

A graphic overview of the PeopleLens system describes its functionality and experience features with accompanying icons.
The functionality of the PeopleLens system includes experience features such as recognizing a person in front of the wearer; attention notifications from the direction of those who look at the wearer; the ability to follow someone; and an orientation guide to help wearers find people and faces.

Community collaboration

The success of the PeopleLens, as well as systems like it, is dependent on a prototyping process that includes close collaboration with the people it is intended to serve. Our work with children who are blind and their support systems has put us on a path toward building a tool that can have practical value and empower those using it. We encourage those interested in the PeopleLens to reach out about participating in our study and help us further evolve the technology. 

To learn more about the PeopleLens and its development, check out the Innovation Stories blog about the technology.

The post PeopleLens: Using AI to support social interaction between children who are blind and their peers appeared first on Microsoft Research.

Read More

An animated line-plot showing the stability of optimal learning rate as we change the neural network’s parametrization. The parametrization is varied by interpolating between mup-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that mup-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.

µTransfer: A technique for hyperparameter tuning of enormous neural networks

An animated line-plot showing the stability of optimal learning rate as we change the neural network’s parametrization. The parametrization is varied by interpolating between mup-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that mup-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.

Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial bodies. In the same way, when it comes to building large-scale AI systems, fundamental research forms the theoretical insights that drastically reduce the amount of trial and error necessary and can prove very cost-effective. 

In this post, we relay how our fundamental research enabled us, for the first time, to tune enormous neural networks that are too expensive to train more than once. We achieved this by showing that a particular parameterization preserves optimal hyperparameters across different model sizes. This is the µ-Parametrization (or µP, pronounced “myu-P”) that we introduced in a previous paper, where we showed that it uniquely enables maximal feature learning in the infinite-width limit. In collaboration with researchers at OpenAI, we verified its practical advantage on a range of realistic scenarios, which we describe in our new paper, “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer.”

By greatly reducing the need to guess which training hyperparameters to use, this technique can accelerate research on enormous neural networks, such as GPT-3 and potentially larger successors in the future. We also released a PyTorch package that facilitates the integration of our technique in existing models, available on the project GitHub page or by simply running (texttt{pip install mup}).

“µP provides an impressive step toward removing some of the black magic from scaling up neural networks. It also provides a theoretically backed explanation of some tricks used by past work, like the T5 model. I believe both practitioners and researchers alike will find this work valuable.”

— Colin Raffel, Assistant Professor of Computer Science, University of North Carolina at Chapel Hill and co-creator of T5

Scaling the initialization is easy, but scaling training is hard

Large neural networks are hard to train partly because we don’t understand how their behavior changes as their size increases. Early work on deep learning, such as by Glorot & Bengio and He et al., generated useful heuristics that deep learning practitioners widely use today. In general, these heuristics try to keep the activation scales consistent at initialization. However, as training starts, this consistency breaks at different model widths, as illustrated on the left in Figure 1.

Unlike at random initialization, behavior during training is much harder to mathematically analyze. Our goal is to obtain a similar consistency so that as model width increases, the change in activation scales during training stay consistent and similar to initialization to avoid numerical overflow and underflow. Our solution, µP, achieves this goal, as seen on the right in Figure 1, which shows the stability of network activation scales for the first few steps of training across increasing model width. 

Two line-plots showing the change in activation scale between PyTorch default and the µ-Parametrization. Under PyTorch default, the activation scale grows as the network width increases for a particular time step. Under µ-Parametrization, the activation scale is stable across widths for a particular time step.
Figure 1: In the default parameterization in PyTorch, the graph on the left, the activation scales diverge in width after one step of training. But in µP, the graph on the right, the activation scales change by a consistent amount regardless of width for any training step. The y-axis shows the change of network activation scales on a fixed input after t=0, 1, 2, 3, and 4 steps of training as the width of the model varies, which is shown along the x-axis. 

Our parameterization, which maintains this consistency during training, follows two pieces of crucial insight. First, gradient updates behave differently from random weights when the width is large. This is because gradient updates are derived from data and contain correlations, whereas random initializations do not. Therefore, they need to be scaled differently. Second, parameters of different shapes also behave differently when the width is large. While we typically divide parameters into weights and biases, with the former being matrices and the latter vectors, some weights behave like vectors in the large-width setting. For example, the embedding matrix in a language model is of size vocabsize x width. While the width tends to infinity, vocabsize stays constant and finite. During matrix multiplication, the difference in behavior between summing along a finite dimension and an infinite one cannot be more different.

These insights, which we discuss in detail in a previous blog post, motivated us to develop µP. In fact, beyond just keeping the activation scale consistent throughout training, µP ensures that neural networks of different and sufficiently large widths behave similarly during training such that they converge to a desirable limit, which we call the feature learning limit

A theory-guided approach to scaling width

Our theory of scaling enables a procedure to transfer training hyperparameters across model sizes. If, as discussed above, µP networks of different widths share similar training dynamics, they likely also share similar optimal hyperparameters. Consequently, we can simply apply the optimal hyperparameters of a small model directly onto a scaled-up version. We call this practical procedure µTransfer. If our hypothesis is correct, the training loss-hyperparameter curves for µP models of different widths would share a similar minimum.

Conversely, our reasoning suggests that no scaling rule of initialization and learning rate other than µP can achieve the same result. This is supported by the animation below. Here, we vary the parameterization by interpolating the initialization scaling and the learning rate scaling between PyTorch default and µP. As shown, µP is the only parameterization that preserves the optimal learning rate across width, achieves the best performance for the model with width 213 = 8192, and where wider models always do better for a given learning rate—that is, graphically, the curves don’t intersect. 

An animated line-plot showing the stability of optimal learning rate as we change the neural network’s parametrization. The parametrization is varied by interpolating between µ-Parametrization and PyTorch default in terms of the scaling for the learning rate and the initialization scale. The animation shows that µ-Parametrization is the only parametrization that preserves the optimality of learning rate across model widths; it also achieves the best absolute performance across all parametrizations.
Figure 2: On the left, we train multilayer perceptrons (MLPs) of different widths (which correspond to the curves of different colors and patterns) with different learning rates (shown along the x-axis) on CIFAR10 and plot the training loss along the y-axis. On the right, the 2D plane of parameterizations is formed by interpolation of 1) the initialization scaling between PyTorch default and µP (x-axis), and 2) the learning rate scaling between PyTorch default and µP (y-axis). On this plane, PyTorch default is represented by (0, 0) and µP by (1, 1). The width-256 (log2(width) = 8) model is the same across all frames (except for random seed), but we widen models according to the parameterization represented by the dot on the right. 

Building on the theoretical foundation of Tensor Programs, µTransfer works automatically for advanced architectures, such as Transformer and ResNet. It can also simultaneously transfer a wide range of hyperparameters. Using Transformer as an example, we demonstrate in Figure 3 how the optima of key hyperparameters are stable across widths. 

Four line-plots showing the stability of optima of various hyperparameters across widths. From left-to-right and top-to-bottom, we see that the optima for learning rate, cross-entropy temperature, initialization standard deviation, and learning rate schedule are all roughly stable across widths, from 128 to 4,096.
Figure 3: Transformers of different widths parameterized in µP and trained on WikiText-2. As we increase model width, the optimal learning rate, cross-entropy temperature, initialization scale, and learning rate schedule remain stable. We can meaningfully predict the optimal hyperparameters of a wider network by looking at those of a narrow one. In plot on the lower right, we tried the following learning rate schedules: (a) linear decay, (b) StepLR @ [5k, 8k] with a decay factor of 0.1, (c) StepLR @ [4k, 7k] with a decay factor of 0.3, (d) cosine annealing,(e) constant, and (f) inverse square-root decay. 

“I am excited about µP advancing our understanding of large models. µP’s principled way of parameterizing the model and selecting the learning rate make it easier for anybody to scale the training of deep neural networks. Such an elegant combination of beautiful theory and practical impact.”

— Johannes Gehrke, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)

Beyond width: Empirical scaling of model depth and more

Modern neural network scaling involves many more dimensions than just width. In our work, we also explore how µP can be applied to realistic training scenarios by combining it with simple heuristics for nonwidth dimensions. In Figure 4, we use the same transformer setup to show how the optimal learning rate remains stable within reasonable ranges of nonwidth dimensions. For hyperparameters other than learning rate, see Figure 19 in our paper. 

Four line-plots showing the stability of the optimal learning rate across width, depth, batch size, and sequence length. The width is varied from 128 to 4,096, the depth from 2 to 32, the batch size from 20 to 512, and the sequence length from 32 to 512.
Figure 4: Transformers of different sizes parameterized in µP and trained on Wikitext-2. Not only does the optimal learning rate transfer across width, as shown in Figure 3, it also empirically transfers across other scale dimensions—such as depth, batch size, and sequence length—across the ranges we tested here. This means we can combine our theoretically motivated transfer across width with the empirically verified one across other scale dimensions to obtain the practical procedure, µTransfer, to tune hyperparameters indirectly on a small model and transfer to a large one. 

Testing µTransfer

Now that we have verified the transfer of individual hyperparameters, it is time to combine them in a more realistic scenario. In Figure 5, we compare µTransfer, which transfers tuned hyperparameters from a small proxy model, with directly tuning the large target model. In both cases, the tuning is done via random search. Figure 5 illustrates a Pareto frontier of the relative tuning compute budget compared with the tuned model quality (BLEU score) on IWSLT14 De-En, a machine translation dataset. Across all compute budget levels, µTransfer is about an order of magnitude (in base 10) more compute-efficient for tuning. We expect this efficiency gap to dramatically grow as we move to larger target model sizes. 

A line-plot showing the Pareto-front corresponding to model performance measured in BLEU score and the compute budget for hyperparameter tuning. The curve representing our method, µTransfer, dominates that of conventional tuning with a margin of roughly 10 times in compute budget. Our method also yields the best absolute performance, at almost 35.4 in BLEU score, where as the conventional method tops out at 35.2.
Figure 5: Across different tuning budgets, µTransfer dominates the baseline method of directly tuning the target model. As we train larger target models with billions of parameters, we expect the performance gap to widen, since the proxy model can remain small while still meaningfully predicting the optimal hyperparameters, as shown in Figures 3 and 4. 

A glimpse of the future: µP + GPT-3

Before this work, the larger a model was, the less well-tuned we expected it to be due to the high cost of tuning. Therefore, we expected that the largest models could benefit the most from µTransfer, which is why we partnered with OpenAI to evaluate it on GPT-3. 

After parameterizing a version of GPT-3 with relative attention in µP, we tuned a small proxy model with 40 million parameters before copying the best hyperparameter combination to the 6.7-billion parameter variant of GPT-3, as prescribed by µTransfer. The total compute used during this tuning stage was only 7 percent of the compute used in the pretraining of the final 6.7-billion model. This µTransferred model outperformed the model of the same size (with absolute attention) in the original GPT-3 paper. In fact, it performs similarly to the model (with absolute attention) with double the parameter count from the same paper, as shown in Figure 6. 

Two bar-plots showing the relative performance of GPT-3 6.7B compared to GPT-3 6.7B tuned with µTransfer. On language modeling tasks, including PTB, Wikitext 103, and LM1B, the run with µTransfer achieves lower perplexities. On NLU tasks, including HellaSwag, LAMBADA, and SQuADv2, the run with µTransfer achieves higher accuracies, comparable to those achieved by GPT-3 6.7B or GPT-3 13B tuned without µTransfer.
Figure 6: We applied µTransfer to GPT-3 6.7-billion parameter model with relative attention and obtained better results than the baseline with absolute attention used in the original GPT-3 paper, all while only spending 7 percent of the pretraining compute budget on tuning. The performance of this µTransfer 6.7-billion model is comparable to that of the 13-billion model (with absolute attention) in the original GPT-3 paper.

Implications for deep learning theory

As shown previously, µP gives a scaling rule which uniquely preserves the optimal hyperparameter combination across models of different widths in terms of training loss. Conversely, other scaling rules, like the default in PyTorch or the NTK parameterization studied in the theoretical literature, are looking at regions in the hyperparameter space farther and farther from the optimum as the network gets wider. In that regard, we believe that the feature learning limit of µP, rather than the NTK limit, is the most natural limit to study if our goal is to derive insights that are applicable to feature learning neural networks used in practice. As a result, more advanced theories on overparameterized neural networks should reproduce the feature learning limit of µP in the large width setting. 

Theory of Tensor Programs

The advances described above are made possible by the theory of Tensor Programs (TPs) developed over the last several years. Just as autograd helps practitioners compute the gradient of any general computation graph, TP theory enables researchers to compute the limit of any general computation graph when its matrix dimensions become large. Applied to the underlying graphs for neural network initialization, training, and inference, the TP technique yields fundamental theoretical results, such as the architectural universality of the Neural Network-Gaussian Process correspondence and the Dynamical Dichotomy theorem, in addition to deriving µP and the feature learning limit that led to µTransfer. Looking ahead, we believe extensions of TP theory to depth, batch size, and other scale dimensions hold the key to the reliable scaling of large models beyond width. 

Applying µTransfer to your own models

Even though the math can be intuitive, we found that implementing µP (which enables µTransfer) from scratch can be error prone. This is similar to how autograd is tricky to implement from scratch even though the chain rule for taking derivatives is very straightforward. For this reason, we created the mup package to enable practitioners to easily implement µP in their own PyTorch models, just as how frameworks like PyTorch, TensorFlow, and JAX have enabled us to take autograd for granted. Please note that µTransfer works for models of any size, not just those with billions of parameters. 

The journey has just begun

While our theory explains why models of different widths behave differently, more investigation is needed to build a theoretical understanding of the scaling of network depth and other scale dimensions. Many works have addressed the latter, such as the research on batch size by Shallue et al., Smith et al., and McCandlish et al., as well as research on neural language models in general by Rosenfield et al. and Kaplan et al. We believe µP can remove a confounding variable for such investigations.  Furthermore, recent large-scale architectures often involve scale dimensions beyond those we have talked about in our work, such as the number of experts in a mixture-of-experts system. Another high-impact domain to which µP and µTransfer have not been applied is fine tuning a pretrained model. While feature learning is crucial in that domain, the need for regularization and the finite-width effect prove to be interesting challenges. 

We firmly believe in fundamental research as a cost-effective complement to trial and error and plan to continue our work to derive more principled approaches to large-scale machine learning. To learn about our other deep learning projects or opportunities to work with us and even help us expand µP, please go to our Deep Learning Group page.

The post µTransfer: A technique for hyperparameter tuning of enormous neural networks appeared first on Microsoft Research.

Read More

Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on mulitmodal data, including RGB image, segmentation, depth and optical flow. The pretrained COMPASS model can be deployed to various downstream tasks of autonomous systems. In this work, we transfer COMPASS to drone navigation, car racing and visual odometry, which are deployed in very different environments and application scenarios.

COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems

Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on mulitmodal data, including RGB image, segmentation, depth and optical flow. The pretrained COMPASS model can be deployed to various downstream tasks of autonomous systems. In this work, we transfer COMPASS to drone navigation, car racing and visual odometry, which are deployed in very different environments and application scenarios.
Figure 1: COMPASS is a general-purpose pretraining pipeline, which is trained on multimodal data, including RGB images, depth and optical flow. The pretrained COMPASS model can be deployed on various downstream autonomous systems tasks. In this work, we test COMPASS on simulated drone navigation, car racing and visual odometry. This highlights how the system can be deployed in very different environments and application scenarios.

Humans have the fundamental cognitive ability to perceive the environment through multimodal sensory signals and utilize this to accomplish a wide variety of tasks. It is crucial that an autonomous agent can similarly perceive the underlying state of an environment from different sensors and appropriately consider how to accomplish a task. For example, localization (or “where am I?”) is a fundamental question that needs to be answered by an autonomous agent prior to navigation, often addressed via visual odometry. Highly dynamic tasks, such as vehicle racing, necessitate collision avoidance and understanding of the temporal evolution of their state with respect to the environment. Agents must learn perceptual representations of geometric and semantic information from the environment so that their actions can influence the world.

Task-driven approaches are appealing, but learning representations that are suitable only for a specific task limits their ability to generalize to new scenarios, thus confining their utility. For example, as shown in Figure 1, to achieve tasks of drone navigation and vehicle racing, people usually need to specifically design different models to encode representations from very different sensor modalities, e.g., different environments, sensory signals, sampling rate, etc. Such models must also cope with different dynamics and controls for each application scenario. Therefore, we ask the question if it is possible to build general-purpose pretrained models for autonomous systems that are agnostic to tasks and individual form factor.

In our recent work, COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems, we introduce a general-purpose pretraining pipeline, built to overcome such limitations arising from task-specific models. The code can be viewed on GitHub.

COMPASS features three key aspects:

  • COMPASS is a general-purpose large-scale pretraining pipeline for perception-action loops in autonomous systems. Representations learned by COMPASS generalize to different environments and significantly improve performance on relevant downstream tasks.
  • COMPASS is designed to handle multimodal data. Given the prevalence of multitudes of sensors in autonomous systems, the framework is designed to utilize rich information from different sensor modalities.
  • COMPASS is trained in a self-supervised manner which does not require manual labels, and hence can leverage large scale data for pretraining.

We demonstrate how COMPASS can be used to solve various downstream tasks across three different scenarios: Drone Navigation, Vehicle Racing, and Visual Odometry tasks.

Challenges in learning generic representations for autonomous systems

Although general-purpose pretrained models have made breakthroughs in natural language processing (NLP) and in computer vision, building such models for autonomous systems has its own challenges.

  • Autonomous systems deal with complex perception-action interplay. The target learning space is highly variable due to a wide range of environmental factors and application scenarios. This is in stark contrast to language models, which focus on underlying linguistic representations, or visual models, which focus on object-centric semantics. These aspects make existing pretraining approaches inadequate for autonomous systems.
  • The environments are usually perceived through multimodal sensors, so the model must be able to make sense of multimodal data. Existing multimodal learning approaches focus primarily on mapping multimodal data into joint latent spaces. Though they have shown promising results in applications of video, audio, and text, they are suboptimal for autonomous systems. Approaches that learn a single joint latent space fail to respect different properties of multimodal data, such as sampling rate and temporal dynamics. On the other hand, mapping into disjoint latent spaces loses the connection among the modalities and limits the usage in complex autonomous systems, because different autonomous systems can be equipped with a wide variety of sensor configurations.
  • Unlike NLP and computer vision, there is scarcity of multimodal data that can be used to train large pretrained representations for autonomous systems.
a multimodal graph which maps modalities into factored spatial and temporal latent spaces.
Figure 2: Given multimodal signals of spatial and temporal modalities (mathcal{M}_{s}) and (mathcal{M}_{m}), respectively, COMPASS learns two factorized latent spaces, i.e., a motion pattern space (mathcal{O}_m) and a current state space (mathcal{O}_s), using multimodal correspondence as the self-supervisory signal.

Factorized spatiotemporal latent spaces for learning representations

COMPASS is a multimodal pretraining framework for perception and action in autonomous systems. COMPASS builds general-purpose multimodal representations that can generalize to different environments and tasks.

Two questions inform our design choices in COMPASS:

  • What essential pieces of information are common for all tasks of autonomous systems?
  • How can we effectively learn representations from complex multimodal data to capture the desired information?

The network architecture design must adhere to the spatiotemporal constraints of autonomous systems. The representation needs to account for the motion (ego-motion or environmental) and its temporal aspects as well as the spatial, geometric, and semantic cues perceived through the sensors. Therefore, we propose a multimodal graph that captures the spatiotemporal properties of the modalities (Fig. 2). The graph is designed to map each of the modalities into two factorized spatiotemporal latent subspaces: 1) the motion pattern space and 2) the current state space. The self-supervised training then uses multimodal correspondence to associate the modality to the different latent spaces. Such a factorized representation further allows systems equipped with different sensors to use the same pretrained model.

While plenty of sensor modalities are rich in spatial and semantic cues, such as RGB images, depth sensors), we note that certain modalities primarily contain information about the temporal aspect, such as IMU, Optical Flow). Given such a partition of modalities between spatially informative ((mathcal{M}_{s})) and temporally informative (mathcal{M}_{m}) data, we jointly learn two latent spaces, a “motion pattern space” (mathcal{O}_{m}) and a “current state space” (mathcal{O}_{s}).

Pretraining pipeline and model design of COMPASS model.
Figure 3: Self-supervised Pretraining pipeline based on Contrastive Learning for COMPASS.

Contrastive learning via multimodal graph connections

The key intuition behind the self-supervised objective for training COMPASS is that if the representation successfully captures spatiotemporal information across multiple modalities, then each modality should have some predictive capacity both for itself as well as the others. We formulate this intuition into a contrastive learning objective. Figure 3 graphically depicts the idea where the modality-specific encoders (E) extract embeddings from each modality. These are then mapped to the common motion pattern space (mathcal{O}_{m}) through the motion pattern projection head (mathcal{F}_m). A prediction head (mathcal{P}) is added on top to perform future prediction. The contrastive loss is computed between the predicted future representations and their corresponding encoded true representations. Similarly, the contrastive objective also associates the data between distinct spatial modalities (mathcal{M}_{s}) projected onto the current state space (mathcal{O}_s) at every time step.

Note that modalities that are primarily temporal are projected to the motion pattern space through (mathcal{F}_m) only. Modalities that are only spatial are first projected onto the current state space by (mathcal{F}_s). To better associate spatial modalities with the temporal ones, we introduce a spatiotemporal connection where spatial modalities from multiple timesteps are aggregated via an aggregator head (mathcal{G}) and projected into the motion pattern space. Such as multimodal graph with spatial, temporal, and spatiotemporal connections serves as a framework for learning multimodal representations by encoding the underlying properties of modalities (such as static, dynamic) as well as any common information shared between them (for example, geometry, motion).

Finally, we tackle the challenge of data scarcity by resorting to simulation. In particular, we build upon our previous work in high-fidelity simulation with AirSim and use the TartanAir dataset (TartanAir: A Dataset to Push the Limits of Visual SLAM – Microsoft Research) to train the model.

Deploying COMPASS to downstream tasks

After pretraining, the COMPASS model can be finetuned for several downstream tasks. Based on the sensor modalities available for the task of interest, we connect the appropriate pretrained COMPASS encoders to small neural network modules responsible for task-specific predictions such as robot actions, camera poses etc. This combined model is then finetuned given data and objectives from the specific task.

We demonstrate the effectiveness of COMPASS as a general-purpose pretraining approach on three downstream tasks: simulated drone navigation, simulated vehicle racing, and visual odometry. Figure 4 and Table 1 show some details about both our pretraining as well as downstream task datasets.

Data samples of pretraining dataset (TartanAIR), and downstream datasets for drone navigation, car racing and visual odometry.
Figure 4: Samples from TartanAIR and the downstream task datasets. TartanAir contains RGB, depth, segmentation and optical flow data modalities.
Dataset Usage Scale Env.
TartanAIR Pretraining 1M 16
Soccer-gate Drone Navigation.  3k 1
KITTI Visual Odometry 23K 11
AirSim-Car Car racing 17K 9
Table 1: Various datasets used in our experiments.

Drone Navigation

The goal of this task is to enable a quadrotor drone to navigate through a series of gates whose locations are unknown to it a priori. The simulated environment contains a diverse set of gates varying in shape, sizes, color, and texture. Given RGB images from the camera onboard the drone in this environment, the model is asked to predict velocity commands to make the drone successfully go through a series of gates. Figure 5 highlights that finetuning COMPASS for this velocity prediction task results in better performance than training a model from scratch.

Line plots showing validation errors on drone navigation task.
Figure 5(a-d): Performance of COMPASS on drone velocity predictions, compared with a model trained from scratch.

COMPASS can improve data efficiency. Furthermore, finetuning over pretrained COMPASS models exhibits more data efficient learning than training models from scratch. Figure 6 compares finetuning performance with different amounts of data to training from scratch. We see that COMPASS finetuning consistently produces fewer errors than training from scratch, even with less data.

Data samples of pretraining dataset (TartanAIR), and downstream datasets for drone navigation, car racing and visual odometry.
Figure 6: Comparison of COMPASS finetuning vs. training from scratch with varying amounts of data

Visual Odometry

Visual odometry (VO) aims to estimate camera motion from consecutive image frames. This is a fundamental component in visual SLAM which is widely used for localization in robotics. We evaluate COMPASS for the VO task on a widely used real-world dataset (The KITTI Vision Benchmark Suite (cvlibs.net)). We first use an off-the-shelf optical flow model (PWC-Net) to generate optical flow data given consecutive image frames, which are then inputted to the optical flow encoder of COMPASS, eventually resulting in predicted camera motion.

Methods Sequence 9 Sequence 10
(t_{rel}) (r_{rel}) (t_{rel}) (r_{rel})
ORB-SLAM2 15.3 0.26 3.71 0.3
DVSO 0.83 0.21 0.74 0.21
D3VO 0.78 0.62
VISO2-M 4.04 1.43 25.2 3.8
DeepVO N/A N/A 8.11 8.83
Wang et al. 8.04 1.51 6.23 0.97
TartanVO 6.00 3.11 6.89 2.73
UnDeepVO N/A N/A 10.63 4.65
GeoNet 26.93 9.54 20.73 9.04
COMPASS (ours) 2.79 0.98 2.41 1.00
Table 2: Comparison of translation and rotation errors on KITTI dataset. The first section includes three SLAM methods, while the others are VO approaches. (t_{rel}): average translational RMSE drift ((%)) on a length of 100-800 m. (r_{rel}): average rotational RMSE drift ((^{circ}/100 m)) on a length of 100-800 m.
Trajectory plots of different approaches on KITTI dataset.
Figure 7: Comparison of the predicted KITTI trajectories by different VO approaches. TartanVO is a learning-based VO (only relies on two frames, same as ours), and ORBSLAM2 is a geometry-based SLAM system (includes multi-frame optimization).

COMPASS can adapt to real-world scenarios. In this experiment, we finetune the model on sequence 00-08 of KITTI and test it on sequence 09 and 10. For comprehensive investigation, we compare COMPASS with both SLAM methods and visual odometry methods. The results are shown in Table 2, where we list the relative pose error (RPE), which is the same metric used on KITTI benchmark. Using the pretrained flow encoder from COMPASS within this VO pipeline achieves better results than several other VO methods and is even comparable to SLAM methods. Figure 7 shows the predicted trajectories of sequences 09 and 10 compared to ground truth. For clarity, we also select one representative model from the geometry-based and learning-based approaches each. We can see that, although pretrained purely on simulation data, COMPASS adapts well to finetuning on real-world scenarios.

Vehicle Racing

The goal here is to enable autonomous vehicles to drive in a competitive Formula racing environment. The simulated environment contains visual distractors such as advertising signs, tires, grandstands, and fences, which help add realism and increase task difficulty. Given RGB images from the environment as input, the control module must predict the steering wheel angle for a car to successfully maneuver around the track and avoid obstacles.

Model Seen env. Unseen env.
SCRATCH 0.085 ± 0.025 0.120 ± 0.009
CPC 0.037 ±0.012 0.101 ± 0.017
CMC 0.039 ± 0.013 0.102 ± 0.012
JOINT 0.055 ± 0.016 0.388 ± 0.018
DISJOINT 0.039 ± 0.017 0.131 ± 0.016
COMPASS 0.041 ± 0.013 0.071 ± 0.023
Table 3: Steering prediction for car racing.
Line plots comparing training & validation performance of several approaches on car racing task.
Figure 8: Training (a) and Testing (b) loss curves on the vehicle racing task.

COMPASS can generalize to unseen environments. We hypothesize that better perception, enabled by pretraining, improves generalization to unseen environments. To show this, we evaluate models in two settings: 1) trained and evaluated on all nine scenarios (“seen”); 2) trained on eight scenarios and evaluated on one scenario (“unseen”). Table 3 shows that the performance degradation in the unseen environment is relatively marginal with (texttt{COMPASS}), which suggests its effectiveness compared to the other pretraining approaches.

COMPASS can benefit from multimodal training regime. We investigate the effectiveness of pretraining on multimodal data by analyzing loss curves from different pretrained models on the same ‘unseen’ environments. Figure 8(b) compares the validation loss curves of (texttt{COMPASS}), (texttt{RGB}), and (texttt{Scratch}), where (texttt{RGB}) is the model that is pretrained only with RGB images. As we can see, by pretraining on multimodal data, COMPASS achieves the best performance overall. Also, both of these pretraining models show large gaps when compared to a model trained from scratch ((texttt{Scratch})). When comparing Figure 8(a) to Figure 8(b), we see that (texttt{Scratch}) suffers more from the overfitting issue than the other two models.

Conclusion

We introduce COntrastive Multimodal pretraining for AutonomouS Systems (COMPASS), a ‘general’ pretraining framework that learns multimodal representations to tackle various downstream autonomous system tasks. In contrast to existing task-specific approaches in autonomous systems, COMPASS is trained entirely agnostic to any downstream tasks, with the primary goal of extracting information that is common to multiple scenarios. COMPASS learns to associate multimodal data with respect to their properties, allowing it to encode the spatio-temporal nature of data commonly observed in autonomous systems. We demonstrated that COMPASS generalizes well to different downstream tasks—drone navigation, vehicle racing and visual odometry—even in unseen environments, real-world environments and in the low-data regime.

The post COMPASS: COntrastive Multimodal Pretraining for AutonomouS Systems appeared first on Microsoft Research.

Read More