Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization 

Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization 

Analog Iterative Machine (AIM)

Picture a world where computing is not limited by the binary confines of zeros and ones, but instead, is free to explore the vast possibilities of continuous value data. Over the past three years a team of Microsoft researchers has been developing a new kind of analog optical computer that uses photons and electrons to process continuous value data, unlike today’s digital computers that use transistors to crunch through binary data. This innovative new machine has the potential to surpass state-of-the-art digital technology and transform computing in years to come.

The Analog Iterative Machine (AIM) is designed to solve difficult optimization problems, which form the foundation of many industries, such as finance, logistics, transportation, energy, healthcare, and manufacturing. However, traditional digital computers struggle to crack these problems in a timely, energy-efficient and cost-effective manner. This is because the number of possible combinations explodes exponentially as the problem size grows, making it a massive challenge for even the most powerful digital computers. The Traveling Salesman Problem is a classic example. Imagine trying to find the most efficient route for visiting a set of cities just once before returning to the starting point. With only five cities, there are 12 possible routes – but for a 61-city problem, the number of potential routes surpasses the number of atoms in the universe.

AIM addresses two simultaneous trends. First, it sidesteps the diminishing growth of computing capacity per dollar in digital chips – or the unraveling of Moore’s Law. Second, it overcomes the limitations of specialized machines designed for solving optimization problems. Despite over two decades of research and substantial industry investment, such unconventional hardware-based machines have a limited range of practical applications, because they can only address optimization problems with binary values. This painful realization within the optimization community has driven the team to develop AIM, with a design that combines mathematical insights with cutting-edge algorithmic and hardware advancements. The result? An analog optical computer that can solve a much wider range of real-world optimization problems while operating at the speed of light, offering potential speed and efficiency gains of about a hundred times.

Today, AIM is still a research project, but the cross-disciplinary team has recently assembled the world’s first opto-electronic hardware for mixed – continuous and binary – optimization problems. Though presently operating on a limited scale, the initial results are promising, and the team has started scaling up its efforts. This includes a research collaboration with the UK-based multinational bank Barclays to solve an optimization problem critical to the financial markets on the AIM computer. Separate engagements are aimed at gaining more experience in solving industry-specific optimization problems. In June 2023, the team launched an online service that provides an AIM simulator to allow partners to explore the opportunities created by this new kind of computer.

The technology 

Photons possess a remarkable property of not interacting with one another, which has underpinned the internet era by enabling large amounts of data to be transmitted over light across vast distances. However, photons do interact with the matter through which they propagate, allowing for linear operations such as addition and multiplication, which form the basis for optimization applications. For instance, when light falls on the camera sensor on our smartphones, it adds up the incoming photons and generates the equivalent amount of current. Additionally, data transmission over fiber which brings internet connectivity to homes and businesses relies on encoding zeroes and ones onto light by programmatically controlling its intensity. This scaling of light through light-matter interaction multiplies the light intensity by a specific value – multiplication in the optical domain. Beyond optical technologies for linear operations, various other electronic components prevalent in everyday technologies can perform non-linear operations that are also critical for efficient optimization algorithms.

Analog optical computing thus involves constructing a physical system using a combination of analog technologies – both optical and electronic – governed by equations that capture the required computation. This can be very efficient for specific application classes where linear and non-linear operations are dominant. In optimization problems, finding the optimal solution is akin to discovering a needle in an inconceivably vast haystack. The team has developed a new algorithm that is highly efficient at such needle-finding tasks. Crucially, the algorithm’s core operation involves performing hundreds of thousands or even millions of vector-matrix multiplications – the vectors represent the problem variables whose values need to be determined while the matrix encodes the problem itself. These multiplications are executed swiftly and with low energy consumption using commodity optical and electronic technologies, as shown in Figure 1.

Figure 1: Illustration of the AIM computer
Figure 1: Illustration of the AIM computer, which implements massively parallel vector-matrix multiplication using commodity optical technologies (in the back) and non-linearity applied using analog electronics (front). The vector is represented using an array of light sources, the matrix is embedded into the modulator array (shown in grayscale) and the result is collected into the camera sensor.
Figure 2: The second-generation AIM computer
Figure 2: The second-generation AIM computer, with 48 variables, is a rack-mounted appliance.

Thanks to the miniaturization of all these components onto tiny centimeter-scale chips, the entire AIM computer fits into a small rack enclosure – as shown in Figure 2. As light travels incredibly fast – 5 nanoseconds per meter – each iteration within the AIM computer is significantly faster and consumes less electricity than running the same algorithm on a digital computer. Importantly, since the entire problem is embedded into the modulator matrix inside the computer itself, AIM does not require the problem to be transferred back and forth between storage and compute locations. And unlike synchronous digital computers, AIM’s operation is entirely asynchronous. These architectural choices circumvent key historical bottlenecks for digital computers. 

Finally, all technologies used in AIM are already prevalent in consumer products with existing manufacturing ecosystems, which paves the way for a viable computing platform, at full scale, if all the technical challenges can be tamed by the team.

The importance of optimization problems

Optimization problems are mathematical challenges that require finding the best possible solution from a set of feasible alternatives. The modern world relies heavily on efficient solutions to these problems – from managing electricity in our power grids and streamlining goods delivery across sea, air, and land, to optimizing internet traffic routing.

Effectively and efficiently solving optimization problems can significantly improve processes and outcomes across many other industries. Take finance, for example, where portfolio optimization involves selecting the ideal combination of assets to maximize returns while minimizing risks. In healthcare, optimizing patient scheduling can enhance resource allocation and minimize waiting times in hospitals.

For many larger problems, even the world’s biggest supercomputer would take years or even centuries to find the optimal solution to such problems. A common workaround is heuristic algorithms – problem-solving techniques that provide approximate solutions by employing shortcuts or “rules of thumb.” Although these algorithms might not guarantee the discovery of an optimal solution, they are the most practical and efficient methods for finding near-optimal solutions in reasonable timeframes. Now, imagine the immense impact of a computer that could deliver more optimal solutions in a significantly shorter timeframe for these critical problems. In some instances, solving these problems in real-time could create a domino effect of positive outcomes, revolutionizing entire workflows and industries.

QUMO: a world beyond QUBO

For years, researchers, both in industry and academia, have built impressive specialized machines to efficiently solve optimization problems using heuristic algorithms. This includes an array of custom hardware, such as field programmable gate arrays (FPGAs), quantum annealers, and electrical and optical parametric oscillator systems. However, all of them rely on mapping difficult optimization problems to the same binary representation, often referred to as Ising, Max-Cut or QUBO (quadratic unconstrained binary optimization). Unfortunately, none of these efforts have provided a practical alternative to conventional computers. This is because it is very hard to map real-world optimization problems at scale to the binary abstraction, a common theme in the team’s engagement with practitioners across industry and academia.

With AIM, the team has introduced a more expressive mathematical abstraction called QUMO (quadratic unconstrained mixed optimization), which can represent mixed – binary and continuous – variables and is compatible with hardware implementation, making it the “sweetspot” for many practical, heavily-constrained optimization problems. Discussions with industry experts indicate that scaling AIM to 10,000 variables would mean that most of the practical problems discussed earlier are within reach. A problem with 10,000 variables that can be directly mapped to the QUMO abstraction would require an AIM computer with 10,000 physical variables. In contrast, existing specialized machines would need to scale to beyond a million physical variables, well beyond the capabilities of the underlying hardware.

AIM also implements a novel and efficient algorithm for solving such QUMO problems that relies on an advanced form of gradient descent, a technique that is also popular in machine learning. The algorithm shows highly competitive performance and accuracy across various industrially inspired problem benchmarks. It even discovered new best-ever solutions to four problems. The first-generation AIM computer, built last year, solves QUMO optimization problems that are represented with an accuracy of up to 7 bits. The team, shown in Figure 3, has also shown good quantitative agreement between the simulated and the hardware version of the AIM computer to gain further confidence in the viability of these efficiency gains as the computer is scaled up. This paper gives more details about the AIM architecture, its implementation, evaluation and scaling roadmap.

Photo of the AIM team – Front row (left to right): Doug Kelly, Jiaqi Chu, James Clegg, Babak Rahmani. Back row: Hitesh Ballani, George Mourgias-Alexandris, Daniel Cletheroe, Francesca Parmigiani, Lucinda Pickup, Grace Brennan, Ant Rowstron, Kirill Kalinin, Jonathan Westcott, Christos Gkantsidis. (Greg O'Shea and Jannes Gladrow do not appear in this photo.)
Figure 3: AIM’s design involves innovation at the intersection of optical and analog hardware, mathematics and algorithms, and software and system architecture, which is typified in the cross-disciplinary nature of the team working hand-in-hand towards the mission of building a computer that solves practical problems. Photo of the AIM team – Front row (left to right): Doug Kelly, Jiaqi Chu, James Clegg, Babak Rahmani. Back row: Hitesh Ballani, George Mourgias-Alexandris, Daniel Cletheroe, Francesca Parmigiani, Lucinda Pickup, Grace Brennan, Ant Rowstron, Kirill Kalinin, Jonathan Westcott, Christos Gkantsidis. (Greg O’Shea and Jannes Gladrow do not appear in this photo.)

Rethinking optimization with QUMO: A more expressive way of reasoning for experts

AIM’s blueprint for co-designing unconventional hardware with an expressive abstraction and a new algorithm has the potential to spark a new era in optimization techniques, hardware platforms, and automated problem mapping procedures, utilizing the more expressive QUMO abstraction. This exciting journey has already begun, with promising results from mapping problems from diverse domains like finance and healthcare to AIM’s QUMO abstraction. Recent research has already shown that increased expressiveness with continuous variables can substantially expand the real-world business problems that can be tackled. However, to the team’s knowledge, AIM is the first and only hardware to natively support this abstraction.

As we venture into a new abstraction, we must also adopt new ways of thinking. It is crucial for the team to build a strong community to deeply investigate the benefits of embracing QUMO. We invite people who have previously been deterred by the limitations of binary solvers to consider the new opportunities offered by AIM’s QUMO abstraction. To facilitate this, we are releasing our AIM simulator as a service, allowing selected users to get first-hand experience. The initial users are the team’s collaborators at Princeton University and at Cambridge University. They have helped us identify several exciting problems where the AIM computer and its abstraction is a much more natural fit. We are also actively engaging with thought leaders from internal Microsoft divisions and external companies in sectors where optimization is crucial.

Together, we can drive innovation and unlock the true potential of analog optical computing for solving some of the most complex optimization problems across industries.

The post Unlocking the future of computing: The Analog Iterative Machine’s lightning-fast approach to optimization  appeared first on Microsoft Research.

Read More

Research Focus: Week of June 19, 2023

Research Focus: Week of June 19, 2023

Microsoft Research Focus 18 | Week of June 19, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESOURCE

Responsible AI Maturity Model

As the use of AI continues to surge, new government regulations are expected. But the organizations that build and use AI technologies needn’t wait to devise best practices for developing and deploying AI systems responsibly. Many companies have adopted responsible AI (RAI) principles as a form of self-regulation. Yet, effectively translating these principles into practice is challenging.

To help organizations identify their current and desired levels of RAI maturity, researchers at Microsoft have developed the Responsible AI Maturity Model (RAI MM). The RAI MM is a framework containing 24 empirically derived dimensions that are key to an organization’s RAI maturity, and a roadmap of maturity progression so organizations and teams can identify where they are and where they could go next.

Derived from interviews and focus groups with over 90 RAI specialists and AI practitioners, the RAI MM can help organizations and teams navigate their RAI journey, even as RAI continues to evolve.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

NEW RESEARCH

FoundWright helps people re-find web content they previously discovered

Re-finding information is a common task—most online search requests involve re-finding information. However, this can be difficult when people struggle to express what they seek. People may forget exact details of the information they want to re-find, making it hard to craft a query to locate it. People may also struggle to recover information within web repositories, such as bookmarks or history, as these do not capture enough information, or present an experience to allow ambiguous queries. As a result, people can feel overwhelmed and cognitively exhausted when faced with a re-finding task.

A new paper from Microsoft researchers: FoundWright: A System to Help People Re-find Pages from Their Web-history, introduces a new system to address these problems. FoundWright leverages recent advances in language transformer models to expand people’s ability to express what they seek by defining concepts that can attract documents with semantically similar content. The researchers used FoundWright as a design probe to understand how people create and use concepts; how this expanded ability helps re-finding; and how people engage and collaborate with FoundWright’s machine learning support. The research reveals that this expanded way of expressing re-finding goals complements traditional searching and browsing.


NEW RESEARCH

Trace-Guided Inductive Synthesis of Recursive Functional Programs

In recent years, researchers have made significant advances in synthesis of recursive functional programs, including progress in inductive synthesis of recursive programs from input-output examples. The latter problem, however, continues to pose several challenges.

In a new paper: Trace-Guided Inductive Synthesis of Recursive Functional Programs, which received a distinguished paper award from the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2023), researchers from Microsoft and Purdue University propose a novel trace-guided approach to tackle the challenges of ambiguity and generalization in synthesis of recursive functional programs from examples. This approach augments the search space of programs with recursion traces consisting of sequences of recursive subcalls of programs. It is based on a new version space algebra (VSA) for succinct representation and efficient manipulation of pairs of recursion traces and programs that are consistent with each other. The researchers implement this approach in a tool called SyRup. Evaluating SyRup on benchmarks from prior work demonstrates that it not only requires fewer examples to achieve a certain success rate than existing synthesizers, but is also less sensitive to the quality of the examples.

These results indicate that utilizing recursion traces to differentiate satisfying programs with similar sizes is applicable to a wide range of tasks.


NEW RESEARCH

Wait-Free Weak Reference Counting

Reference counting is a common approach to memory management. One challenge with reference counting is cycles that prevent objects from being deallocated. Systems such as the C++ and Rust standard libraries introduce two types of reference: strong and weak. A strong reference allows access to the object and prevents the object from being deallocated, while a weak reference only prevents deallocation. A weak reference can be upgraded to provide a strong reference if other strong references to the object exist. Hence, the upgrade operation is partial, and may fail dynamically. The classic implementation of this upgrade operation is not wait-free—it can take arbitrarily long to complete if there is contention on the reference count.

In a new paper: Wait-Free Weak Reference Counting, researchers from Microsoft propose a wait-free algorithm for weak reference counting, which requires primitive wait-free atomic operations of “compare and swap”, and “fetch and add”. The paper includes a correctness proof of the algorithm using the Starling verification tool, a full implementation in C++, and a demonstration of the best- and worst-case performance using micro-benchmarks.

The new algorithm is faster than the classic algorithm in the best case, but has an overhead in the worst case. The researchers present a more complex algorithm that effectively combines the classic algorithm and the wait-free algorithm, delivering much better performance in the worst case, while maintaining the benefits of the wait-free algorithm.


NEW RESEARCH

Disaggregating Stateful Network Functions

For security, isolation, metering, and other purposes, public clouds today implement complex network functions at every server. Today’s implementations, in software or on FPGAs and ASICs that are attached to each host, are becoming increasingly complex and costly, creating bottlenecks to scalability.

In a new paper: Disaggregating Stateful Network Functions, researchers from Microsoft present a different design that disaggregates network function processing off the host and into shared resource pools by making novel use of appliances which tightly integrate general-purpose ARM cores with high-speed stateful match processing ASICs. When work is skewed across VMs, such disaggregation can offer better reliability and performance over the state of the art, at a lower per-server cost. The paper, which was published at the 2023 USENIX Symposium on Networked Systems Design and Implementation (NSDI), includes solutions to the consequent challenges and presents results from a production deployment at a large public cloud.


NEW RESEARCH

Industrial-Strength Controlled Concurrency Testing for C# Programs with Coyote

Testing programs with concurrency is challenging because their execution is non-deterministic, making bugs hard to find, re-produce and debug. Non-determinism can cause flaky tests—which may pass or fail without any code changes—creating a significant engineering burden on development teams. As concurrency is essential for building modern multi-threaded or distributed systems, solutions are required to help developers test their concurrent code for correctness.

Testing concurrent programs comes with two main challenges. First is the problem of reproducibility or control, while the second challenge is the state-space explosion problem. A concurrent program, even with a fixed test input, can have an enormous number of possible behaviors.

In a new research paper: Industrial-Strength Controlled Concurrency Testing for C# Programs with Coyote, researchers from Microsoft describe the design and implementation of the open-source tool Coyote for testing concurrent programs written in the C# language. This research won a 2023 Best Software Science Paper award from The European Association of Software Science and Technology (EASST).

The post Research Focus: Week of June 19, 2023 appeared first on Microsoft Research.

Read More

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

black and white photos of Microsoft Principal Researcher Dr. Bichlien Nguyen and Dr. David Kwabi, Assistant Professor of Mechanical Engineering at the University of Michigan, next to the Microsoft Research Podcast

Episode 141 | June 22, 2023

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a new Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

In this episode, Microsoft Principal Researcher Dr. Bichlien Nguyen and Dr. David Kwabi, Assistant Professor of Mechanical Engineering at the University of Michigan, join host Dr. Gretchen Huizinga to talk about how their respective research interests—and those of their larger teams—are converging to develop renewable energy storage systems. They specifically explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables for a not rainy day. The bonus? These new compounds may just help advance carbon capture, too.

Transcript

[MUSIC PLAYS UNDER DIALOGUE]

DAVID KWABI: I’m a mechanical engineer who sort of likes to hang out with chemists.

BICHLIEN NGUYEN: I’m an organic chemist by training, and I dabble in machine learning. Bryan’s a computational chemist who dabbles in flow cell works. Anne is, uh, a purely synthetic chemist who dabbles in almost all of our aspects.

KWABI: There’s really interesting synergies that show up just because there’s people, you know, coming from very different backgrounds.

NGUYEN: Because we have overlap, we, we have lower, I’m going to call it an activation barrier, in terms of the language we speak.

[MUSIC]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC ENDS]


Today I’m talking to Dr. Bichlien Nguyen, a Principal Researcher at Microsoft Research, and Dr. David Kwabi, an Assistant Professor of Mechanical Engineering at the University of Michigan. Bichlien and David are collaborating on a fascinating project under the umbrella of the Microsoft Climate Research Initiative that brings organic chemistry and machine learning together to discover new forms of renewable energy storage. Before we unpack the “computational design and characterization of organic electrolytes for flow batteries and carbon capture,” let’s meet our collaborators.

Bichlien, I’ll start with you. Give us a bit more detail on what you do at Microsoft Research and the broader scope and mission of the Microsoft Climate Research Initiative.

BICHLIEN NGUYEN: Thanks so much, Gretchen, for the introduction. Um, so I guess I’ll start with my background. I have a background in organic electric chemistry, so it’s quite fitting, and as a researcher at Microsoft, really, it’s my job, uh, to come up with the newest technologies and keep abreast of what is happening around me so that I can actually, uh, fuse those different technology streams together and create something that’s, um, really valuable. And so as part of that, uh, the Microsoft Climate Research Initiative was really where a group of researchers came together and said, “How can we use the resources, the computational resources, and expertise at Microsoft to enable new technologies that will allow us to get to carbon negative by the year 2050? How can we do that?” And that, you know, as part of that, um, I just want to throw out that the Microsoft Climate Research Initiative really is focusing on three pillars, right. The three pillars are being carbon accounting, because if you don’t know how much carbon is in the atmosphere, you can’t really, uh, do much to remedy it, right, if you don’t know what’s there. The other one is climate resilience. So how do people get affected by climate change, and how do we overcome that, and how can we help that with technology? And then the third is materials engineering, where, that’s where I sit in the Microsoft Climate Research Initiative, and that’s more of how do we either develop technologies that, um, are used to capture and store carbon, uh, or are used to enable the green energy transition?

HUIZINGA: So do you find yourself spread across those three? You say the last one is really where your focus is, but do you dip your toe in the other areas, as well?

NGUYEN: I love dipping my toe in all the areas because I think they’re all important, right? They’re all important. We have to really understand what the environmental impacts of all the materials, for example, that we’re making are. I mean, it’s so … so carbon accounting is really important. Environmental accounting is very important. Um, and then people are the ones that form the core, right? Why are we, do … why do we do what we do? It’s because we want to make sure that we can enable people and solve their problems.

HUIZINGA: Yeah, when you talk about carbon accounting and why you’re doing it, it makes me think about when you have to go on a diet and the doctor says, “You have to get really honest about what you’re eating. Don’t, don’t fake it.” [LAUGHS] So, David, you’re a professor at the University of Michigan, and you run the eponymous Kwabi Lab there. Tell us about your work in general. What are your research interests? Who do you work with, and what excites you most about what you do?

DAVID KWABI: Yeah, happy to! Thank you for the introduction and, and for having me on, on here today. So, um, uh, so as you said, I run the Kwabi Lab here at the University of Michigan, and the sort of headline in terms of what we’re interested in doing is that we like to design and study batteries that can store lots of renewable electricity, uh, on the grid, so, so that’s kind of our mission. Um, that’s not quite all of what we do, but it’s, uh, it’s how I like to describe it. Um, and the motivation, of course, is … comes back to what Bichlien just mentioned, is this need for us to transition from carbon-intensive, uh, ways of producing energy to renewables. And the thing about renewables is that they’re intermittent, so solar and wind aren’t there all the time. You need to find a way to store all that energy and store it cheaply for us to really make, make a dent, um, in carbon emissions from energy production. So we work on, on building systems or energy storage systems that can meet that goal, that can accomplish that task.

HUIZINGA: Both of you talked about having larger teams that support the work you’re doing or collaborate with you two as collaborators. Do you want to talk about the size and scope of those teams or, um, you know, this collaboration across collaboration?

KWABI: Yeah, so I could start with that. So my group, like you said, we’re in the mechanical engineering department, so we really are, um, we call ourselves electric chemical engineers, and electric chemistry is the science of batteries, but it’s a science of lots of other things besides that, but the interesting thing about energy storage systems or batteries in general is that you need to build and put these systems together, but they’re made of lots of different materials. And so what we like to do in my group is build and put together these systems and then essentially figure out how they perform, right. Uh, try to explore performance limits as a function of different chemistries and system configurations and so on. But the hope then is that this can inform materials chemists and computationalists in terms of what materials we want to make next, sort of, so, so there’s a lot of need for collaboration and interdisciplinary knowledge to, to make progress here.

HUIZINGA: Yeah. Bichlien, how about you in terms of the umbrella that you’re under at Microsoft Research?

NGUYEN: There are so many different disciplines, um, within Microsoft Research, but also with the team that we’re working, you know, with David. So we have actually two other collaborators from two different, I guess, departments. There’s the chemical engineering department, which Bryan Goldsmith is part of, and Anne McNeil, I believe, is part of the chemistry department. And you know, for this particular project, on flow battery electrolytes for energy storage, um, we do need a multidisciplinary team, right? We, we need to go from the atomistic, you know, simulation level all the way to the full system level. And I think that’s, that’s where, you know, that’s important.

HUIZINGA: Now that we’re on the topic of this collaboration, let’s talk about how it came about. I like to call this “how I met your mother.” Um, what was the initial felt need for the project, and who called who and said, “Let’s do some research on renewable climate-friendly energy solutions?” Bichlien, why don’t you go ahead and take the lead on this?

NGUYEN: Yeah. Um, so I’m pretty sure what happened—and David, correct me if I’m wrong—

[LAUGHTER]

HUIZINGA: Pretty sure … !

NGUYEN: I’m pretty sure, but not 100 percent sure—is that, um, while we were formulating how to … uh, what, what topics we wanted to target for the Microsoft Climate Research Initiative, we began talking to many different universities as a way to learn from them, to see what areas of interest and what areas they think are, uh, really important for the future. And one of those universities was the University of Michigan, and I believe David was one of few PIs on that initial Teams meeting. And David gave, I believe … David, was it like a 10-minute presentation? Very quick, right? Um, but it sparked this moment of, “Wow, I think we could accelerate this.”

HUIZINGA: David, how do you remember it?

KWABI: Yeah, I think I remember it. [LAUGHS] This is so almost like a, like a marriage. Like, how did you guys meet? Um, and then, and then the stories have to align in some way or …

HUIZINGA: Yeah, who liked who first?

KWABI: Yeah, exactly. Um, but yeah, I think, I think that’s what I recall. So basically, I’m part of … here at the university, I’m part of this group called the Global CO2 Initiative, uh, which is basically, uh, an institute here at the university that convenes research related to CO2 capture, CO2 utilization, um, and I believe the Microsoft team set up a meeting with the Global CO2 Initiative, which I joined, uh, in my capacity as a member, and I gave a little 10-minute talk, which apparently was interesting enough for a second look, so, um, that, that’s how the collaboration started. There was a follow-up meeting after that, and here we are today.

HUIZINGA: Well, it sounds like you’re compelling, so let’s get into it, David. Now would be a good time to talk about, uh, more detail on this research. I won’t call it Flow Batteries for Dummies, but assume we’re not experts. So what are flow batteries, what problems do they solve, and how are you going after your big research goals methodologically?

KWABI: OK, so one way to think about flow batteries is to, to think first about pumped hydro storage is, is how I like to introduce it. So, so a flow battery is just like a battery, the, the sort of battery that you have in your cell phone or laptop computer or electric vehicle, but it’s a … it has a very different architecture. Um, and to explain that architecture, I like to talk about pumped hydro. So pumped hydro is I think a technology many of us probably appreciate or know about. You have two reservoirs that, that hold water—so upper and lower reservoirs—and when you have extra electricity, or, or excess electricity, you can pump water up a mountain, if you like, from one reservoir to another. And when you need that electricity, water just flows down, spins some turbines, and produces electricity. You’re turning gravitational potential energy into electrical energy, or electricity, is the idea. And the nice thing about pumped hydro, um, is that if you want to store more energy, you just need a bigger reservoir. So you just need more water essentially, um, in, in the two reservoirs to get to longer and longer durations of storage, um, and so then this is nice because more and more water is actually … is cheap. So the, the marginal cost of your … every stored, um, unit of energy is quite low. This isn’t really the case for the source of batteries we have in our cell phones and laptop computers. So if we’re talking about grid storage, you want something like this, something that decouples energy and power, so we have essentially a low cost per unit of electricity. So, so flow batteries essentially mimic pumped hydro because instead of turning gravitational potential energy into electricity, you’re actually changing or turning chemical potential energy, if you like, into electricity. So essentially what you’re doing is just storing energy in, um, in the form of electrons that are sort of attached to molecules. So you have an electron at a really high energy that is like flowing onto another molecule that has a low energy. That’s essentially the transformation that you’re trying to do in a, in a flow battery. But that’s the analogy I like to, I like to draw. It’s sort of a high- and low-reservation, uh, reservoirs you have. High and low chemical, uh, potential energy.

HUIZINGA: So what do these do better than the other batteries that you mentioned that we’re already using for energy storage?

KWABI: So the other batteries don’t have this decoupling. So in the flow battery, you have the, the energy being stored in these tanks, and the larger the tank, the more the energy. If you want to turn that energy into, uh, chemical energy into electricity, you, you, you run it through a reactor. So the reactor can stay the same size, but the tank gets bigger and bigger and you store more energy. In a laptop battery, you don’t have that. If you want more energy, you just want more battery, and that, the cost of that is the same. So there isn’t this big cost advantage, um, that comes from decoupling the energy capacity from the power capacity.

NGUYEN: David, would you, would you also say that, um, with, you know, redox organic flow batteries, you also kind of decouple the source, right, of the, the material, the battery material, so it’s no longer, for example, a rare earth metal or precious metal.

KWABI: Absolutely. So that’s, that’s then the thing. So when you … so you’ve got … you know, imagine these large systems, these giant tanks with, with molecules that store electricity. The question then is what molecules do you choose? Because if it’s like really expensive, then your electricity is also very expensive. Um, the particular type of battery we’re looking at uses organic molecules to store that electricity, and the hope is that these organic molecules can be made very cheaply from very abundant materials, and in the end, that means that this then translates to a really low cost of electricity.

HUIZINGA: Bichlien, I’m glad you brought that up because that is a great comparison in terms of the rare earth stuff, especially lithium mining right now is a huge cost, or tax, on the environment, and the more electric we have, the more lithium we need, so there’s other solutions that you guys are, are digging around for. Were you going to say something else, Bichlien?

NGUYEN: I was just going to say, I mean, another reason why, um, we thought David’s work was so interesting is because, you know, we’re looking at, um, energy, um, storage for renewables, and so to get to this green energy economy, we’ll need a ton of renewables, and then we’ll need a ton of ways to store the energy because renewables are, you know, they’re intermittent. I mean, sometimes the rain rains all the time, [LAUGHS] and sometimes it doesn’t. It’s really dry. Um, I don’t know why I say rain, but I assume … I probably …

HUIZINGA: Because you’re in Seattle, that’s why!

NGUYEN: That’s true. But like the sun shines; it doesn’t shine. Um, yeah, the wind blows, and sometimes it doesn’t.

HUIZINGA: Or doesn’t. Yeah. … Well, let’s talk about molecules, David, um, and getting a little bit more granular here, or maybe I should say atomic. You’re specifically looking for aqueous-soluble redox-active organic molecules, and you’ve noted that they’re really hard to find, um, these molecules that meet all the performance requirements for real-world applications. In other words, you have to swipe left a lot before you get to a good [LAUGHS] match, continuing with the marriage analogy. … So what are the properties necessary that you’re looking for, and why are they so hard to find?

KWABI: So the “aqueous soluble” just means soluble in water. You want the molecule to be able to dissolve into water at really high concentrations. So that’s one, um, property. You need it to last a really long time because the hope is that these flow battery installations are going to be there for decades. You need it to store electrons at the right energy. So, uh, I mentioned you have two tanks: one tank will store electrons at high energy; the other at low energy. So you need those energy levels to be set just right in a sense, so you want a high-voltage battery, essentially. You also want the molecule to be set that it doesn’t leak from one tank to the other through the reactor that’s in the middle of the two tanks, right. Otherwise, you’re essentially losing material, which is not, uh, desirable. And you want the molecule to be cheap. So that, that’s really important, obviously, because if we’re going to do this at, um, really large scale, and we want this to be low cost, that we want something that’s abundant and cheap. And finding a molecule that satisfies all of these requirements at the same time is really difficult. Um, you can find molecules that satisfy three or four or two, but finding something that hits all the, all the criteria is really hard, as is finding a good partner. [LAUGHS]

HUIZINGA: Well, and even in, in other areas, you hear the phrase cheap, fast, good—pick two, right? So, um, yeah, finding them is hard, but have you identified some or one, or I mean where are you on this search?

KWABI: Right now, the state-of-the-art charge-storing molecule, if you like, is based on a rare earth … rare element called vanadium. So the most developed flow batteries now use vanadium, um, to store electricity. But, uh, vanadium is pretty rare in the Earth’s crusts. It’s unclear if we start to scale this technology, um, to levels that would really make an impact on climate, it’s not clear there’s enough vanadium, uh, to, to do the job. So it fulfills a bunch of, a bunch of the criteria that I just mentioned, but not, not the cheap one, which is pretty important. We’re hoping, you know, with this project, that with organic molecules, we can find examples or particular compounds that really can fulfill all of these requirements, and, um, we’re excited because organic chemistry gives us, uh … there’s a wide design space with organic molecules, and you’re starting from abundant elements, and, you know, the hope is that we can really get to something that, that, you know, we can swipe left on … is it swipe left or right? I’m not sure.

NGUYEN: I have no idea.

HUIZINGA: Swipe left means …

[LAUGHTER]

HUIZINGA: I looked it up. I, I’ve been married for a really long time, so I did look it up, and it is left if you’re not interested and right if you are, apparently on Tinder, but, uh, not to beat that horse …

KWABI: You want to swipe right eventually.

HUIZINGA: Yes. Which leads me to, uh, Bichlien. What does machine learning have to do with natural sciences like organic chemistry? Why does computation play a role here, particularly generative models, for climate change science?

NGUYEN: Yeah so what, you know, the past decade or two, um, in computer science and machine learning have taught us is that ML is really good at pattern recognition, right, being able to, um, take complex datasets and pull out the most type … you know, relevant, uh, trends and information, and, and it’s good at classifying … you know, used as a classification tool. Uh, and what we know about nature is that nature is full of patterns, right. We see repeating patterns all the time in nature, at many different scales. And when we think, for example, of all of the combinations of carbons, carbon organic molecules, that you could make, you see around 1060; it’s estimated to be 1060. Um, and those are all connected somehow, you know, in this large, you know, space, this large, um, distribution. And we want to, for example, as David mentioned, we want to check the boxes on all these properties. So what is really powerful, we believe, is that generative models will allow us to sample this, this organic chemistry space and allow us to condition the outputs of these models on these properties that David wants to checkmark. And so in a way, it’s allowing us to do more efficient searching. And I like to think about it as like you’re trying to find a needle, right, in the ocean, and the ocean’s pretty vast; needles are really small. And instead of having the size of the ocean as your search space, maybe you have the size of a bathtub and so that we can narrow down the search space and then be able to test and validate some of the, the molecules that come out.

HUIZINGA: So do these models then eliminate a lot of the options, making the pool smaller? Is that how that works to make it a bathtub instead of an ocean?

NGUYEN: I, I wouldn’t say eliminates, but it definitely tells you where you should be … it helps focus where you’re searching.

HUIZINGA: Right, right, right. Well, David, you and I talked briefly and exchanged some email on “The Elements” song by Tom Lehrer, and it’s, it’s a guy who basically sings the entire periodic chart of the elements, really fast, to the piano. But at the end, he mentions the fact that there’s a lot that haven’t been discovered. There’s, there’s blanks in the chart. And so I wonder if, you know, this, this search for molecules, um, it just feels like is there just so much more out there to be discovered?

KWABI: I don’t know if there’s more elements to be discovered, per se, but certainly there’s ways of combining them in ways that produce …

HUIZINGA: Aaahhhhh …

KWABI: … new compounds or compounds with properties that, that we’re looking for, for example, in this project. So, um, that’s, I think, one of the things that’s really exciting about, uh, about this particular endeavor we’re, we’re, we’re engaged in. So, um, one of the ways that people have traditionally thought about finding new molecules for flow batteries is, you know, you go into the lab or you go online and order a chemical that you think is going to be promising [LAUGHS] … some people I know have done this, uh, myself included … but you, you order a chemical that you think is promising, you throw it in the flow battery, and then you figure out if it works or not, right. And if it doesn’t work, you move on to the next compound, or you, um …

NGUYEN: You tweak it!

KWABI: … if it does work, you publish it. Yeah, exactly—you tweak it, for example. Um, but one of the, one of the questions that we get to ask in this project is, well, rather than think about starting from a molecule and then deciding or figuring out whether it works, we, we actually start from the criteria that we’re looking for and then figure out if we can intelligently design, um, a molecule based on the criteria. Um, so it’s, it’s, uh, I think a more promising way of going about discovering new molecules. And, as Bichlien’s already alluded to, with organic chemistry, the possibilities are endless. We’ve seen this already in, like, the pharmaceutical industry for example, um, and lots of other industries where people think about, uh, this combinatorial problem of, how do I get the right structure, the right compound, that solves the problem of, you know, killing this virus or whatever it is. We’re hoping to do something similar for, uh, for flow batteries.

HUIZINGA: Yeah, in fact, as I mentioned at the very beginning of the show, you titled your proposal “The computational design and characterization of organic electrolytes for flow batteries,” so it’s kind of combining all of that together. David, sometimes research has surprising secondary uses. You start out looking for one thing and it turns out to be useful for something else. Talk about the dual purposes of your work, particularly how flow batteries both store energy and work as a sort of carbon capture version of the Ghostbusters’ Ecto-Containment Unit. [LAUGHS]

KWABI: Sure. Yeah, so this is where I sort of confess and say I wasn’t completely up front in the beginning when I said all we do is energy storage, but, um, another, um, application we’re very interested in is carbon capture in my group. And with regard to flow batteries, it turns out that you, you actually can take the same architecture that you use for a flow battery and actually use it to, to capture carbon, CO2 in particular. So the way this would work is, um, it turns out that some of the molecules that we’ve been talking about, some of the organic molecules, when you push an electron onto them—so you’re storing energy and you push an electron onto them—it turns out that some of these molecules also absorb hydrogen ions from water so those two processes sort of happen together. You push an electron onto the molecule, and then it picks up a hydrogen ion from water. Um, and if you remember anything about something from your chemistry classes in high school, that changes the pH of water. If you remove protons, uh, from water, that makes water more basic, or more alkaline. And alkaline electrolytes or alkaline water actually absorbs or reacts with CO2 to make bicarbonate. So that’s a chemical reaction that can serve as a mode, or a mechanism, for removing CO2 from, from the environment, so it could be air, or it could be, uh, flue gas or, you know, exhaust gas from a power plant. So imagine you, you run this process, you push the electron onto the molecule, you change the pH of the solution, you remove CO2 … that can then … you can actually concentrate that CO2 and then run the opposite reaction. So you pull the electron off the molecule; that then dumps protons back into solution, and then you can release all this pure CO2 all of a sudden. So, so now what you can do is take a flow battery that stores energy, but also, uh, use it to separate CO2, separate and concentrate CO2 from a, from a gaseous source. So this is, um, some work that we’ve been pursuing sort of in parallel with our work on energy storage, and the hope is that we can find molecules that, in principle, maybe could do both—could do the energy storage and also help with, uh, with CO2 separation.

HUIZINGA: Bichlien, is that part of the story that was attractive to Microsoft in terms of both storage for energy and getting rid of CO2 in the environment?

NGUYEN: Yeah, absolutely. Absolutely. Of course, the properties of, of, you know, both CO2 capture and the energy storage components are sometimes somewhat, uh—David, correct me if I’m wrong—kind of divergent. It’s, it’s hard to optimize for one and have the other one optimized, too. So it’s really a balance of, of things, and we’re targeting, just right now, for this project, our joint project, the, the energy storage aspect.

HUIZINGA: Yeah. On that note, and either one of you can take this, what do you do with it? I mean, when I, when I used the Ghostbusters’ Ecto-Containment Unit, I was being direct. I mean, you got to put it somewhere once you capture it, whether it’s storing it for good use or getting rid of it for bad use. So how are you managing that?

KWABI: Great question, so … Bichlien, were you going to … were you going to go?

NGUYEN: Oh, I mean, yeah, I was going to say that there are many ways, um, for I’ll call it CO2 abatement, um, once you have it. Um, there are people who are interested in storing it underground, so, uh, mineralizing it in basalt formations, rock formations. There are folks, um, like me, who are interested in, you know, developing new catalysts so that we can convert CO2 to different renewable feedstocks that can be used in different materials like different plastics, different, um, you know, essentially new fuels, things of that nature. And then there’s, you know, commercial applications for pure streams of CO2, as well. Uh, yeah, so I, I would say there’s a variety of things you can do with CO2.

HUIZINGA: What’s happening now? I mean, where does that generally … David we, I, I want to say, we talked about this issue, um, when we met before on some of the downsides of what’s current.

KWABI: Yeah, so currently, um, so there’s, as Bichlien has mentioned, there’s a number of things you could do with it. But right now, of all the sort of large projects that have been set up, large pilot plants for CO2 capture that have been set up, I think the main one is enhanced oil recovery, which is a little bit controversial, um, because what you’re doing with the CO2 there is you’re pumping it underground into an oil field that has become sort of less productive over time. And the goal there is to try to coax a little bit more oil, um, out of this field. So, so you pump the CO2 underground, it mixes in with the oil, and then you … that, that sort of comes back up to the surface, and you separate the CO2 from the oil, and you can, you can go off and, um, use the oil for whatever you use it for. So, so the economically attractive thing there is, there’s, uh, there’s, there’s going to be some sort of payoff. There’s a reason, a commercial incentive, for separating the CO2, uh, but of course the problem is you’re removing oil from the … you’re, you’re extracting more oil that’s going to end up with … in more CO2 emissions. So, um, there are, in principle, many potential options, but there aren’t very many that have both the sort of commercial … uh, where there’s sort of a commercial impact and there’s also sort of the scale to take care of the, you know, the gigatons of CO2 that we’re going to have to draw down, basically, so … .

NGUYEN: Yeah. And I, I think, I mean, you know, to David’s point, that’s true—that, that is what’s happening, you know, today because it provides value, right? The issue, I think, with CO2 capture and storage is that while there’s global utility, there’s no monetary value to it right now. Um, and so it makes it a challenge in terms of being able to industrialize, you know, industries to take care of the CO2. But I, I, I think, you know, as part of the MCRI initiative, you know, we’re very interested in both the carbon capture and the utilization aspect, um, and utilization would mean utilizing the CO2 in productive ways for long-term storage, so think about maybe using CO2, um, converting it electrochemically, for example, into, uh, different monomers. Those monomers maybe could be used in new plastics for long-term storage. Uh, maybe those are recyclable plastics. Maybe they’re plastics that are easily biodegradable. But, you know, one of the issues with using, or manufacturing, is there’s always going to be energy associated with manufacturing. And so that’s why we care a lot about renewables and, and the green energy transition. And, and that’s why, uh, you know, we’re collaborating with David and his team as, as part of that. It’s really full circle. We have to really think about it on a systems level, and the collaboration with David is, is one part of that system.

HUIZINGA: Well, that leads beautifully, and that’s probably an odd choice of words for this question, but it seems like “solving for X” in climate change is a no-lose proposition. It’s a good thing to do. But I always ask what could possibly go wrong, and in this case, I’m thinking about other solutions, some of which you’ve already mentioned, that had seemed environmentally friendly at first, but turned out to have unforeseen environmental impacts of their own. So even as you’re exploring new solutions to renewable energy sources, how are you making sure, or how are you mitigating, harming the environment while you’re trying to save it?

KWABI: That’s a great question. So it’s, it’s something that I think isn’t traditionally, at least in my field, isn’t traditionally sort of part of the “solve for X” when people are thinking about coming up with a new technology or new way of storing renewable electricity. So, you know, in our particular case, one of the things that’s really exciting about the project we’re working on is we’re looking at molecules that are fairly already quite ubiquitous, so, so they’re already being used in the food and textile industry, for example, derivatives of the molecules we’re using. So, you know, thinking about the materials you’re using and the synthetic routes that are necessary to produce them is sort of a pitfall that one can easily sort of get into if you don’t start thinking about this question at the very beginning, right? You might come up with a technology that’s, um, appealing and that works really well, performance-wise, but might not be very recyclable or might have some difficulties in terms of extraction and so on and so forth. So lithium-ion batteries, for example, come to mind. I think you were alluding to this earlier, that, you know, they’re a great technology for electric vehicles, but mining cobalt, extracting cobalt, comes with a lot of, um, just negative impacts in terms of child labor and so on in the Congo, et cetera. So how, how do we, you know, think about, you know, materials that don’t … that sort of avoid this? And I’ll, I’ll just highlight as one of our team members … so Anne McNeil, who’s in the chemistry department here, thinks quite a lot about this, and that’s appropriate because she’s sort of the synthetic chemist on the team. She’s the one who’s thinking a lot about, you know, given we have this molecule we want to make, what’s the most eco-friendly, sustainable route to making that molecule with materials that don’t require, you know, pillaging and polluting the earth to do it in a sense, right. And also materials … and also making it in a way that, you know, at end of life, it can be potentially recycled, right.

HUIZINGA: Right.

KWABI: So thinking about sustainable routes to making these molecules and potential sort of ways of recycling them are things that, um, we’re, we’re trying to, in some sense, to take into consideration. And by we, I mean Anne, specifically, is thinking quite seriously about …

NGUYEN: David … David, can I put words in your mouth?

KWABI: But, yeah. … Yeah, sure, go ahead.

NGUYEN: Um, you’re, you’re thinking of sustainability as being a first design principle for …

KWABI: Yes! I would take those words! Exactly.

NGUYEN: OK. [LAUGHS] Yeah, I mean, that’s really important. I, I agree and second what David said.

HUIZINGA: Bichlien, when we talked earlier, the term co-optimization came up, and I want to dig in here a little bit because whenever there’s a collaboration, each discipline can learn something from the other, but you can also learn about your own in the process. So what are some of the benefits you’ve experienced working across the sciences here for this project? Um, could you provide any specific insights or learnings from this project?

NGUYEN: I mean, I, I think maybe a naive … something that maybe seems naive is that we definitely have to work together in all three disciplines because what we’re also learning from David and Bryan is that there are different experimental and computational timelines that sometimes don’t agree, and sometimes do agree, and we really have to, uh, you know, work together in order to create a unified, I’m not going to call it a roadmap, but a unified research plan that works for everyone. For example, um, it takes much longer to run an experiment to synthesize a molecule … I, I think it takes much longer to synthesize a molecule than, for example, to run a, uh, flow cell, um, experiment. And then on the computational side, you could probably run it, you know, at night, on a weekend, you know, have it done relatively soon, generate molecules. And one of those that we’re, you know, understanding is that the human feedback and the computational feedback, um, it takes a lot of balancing to make sure that we’re on the same track.

HUIZINGA: What do you think, David?

KWABI: Yeah, I think that’s definitely accurate, um, figuring out how we can work together in a way that sort of acknowledges these timelines is really important. And I think … I’m a big believer in the fact that people from somewhat different backgrounds working together, the diversity of background, actually helps to bring about, you know, really great innovative solutions to things. And there’s various ways that this has sort of shown up in our, in own work, I think, and in our, in our discussions. Like, you know, we’re currently working on a particular sort of molecular structure for, uh, for a compound that we think will be promising at storing electricity, and the way we, we came about with it is that my, my group, you know, we ran a flow cell and we saw some data that seemed to suggest that the molecule was decomposing in a certain way, and then Anne’s group, or one of Anne’s students, proposed a mechanism for what might be happening. And then Jake, who works with Bichlien, also … and then thought about, “Well, what, what about this other structure?” So that sort of … and then that’s now informing some of the calculations that are going on, uh, with Bryan. So there’s really interesting synergies that show up just because there’s people working from, you know, coming from very different backgrounds. Like I’m a mechanical engineer who sort of likes to hang out with chemists and, um, there’s actual chemists and then there’s, you know …

NGUYEN: But, David, I think …

KWABI: … the people who do computation, and so on …

NGUYEN: I think you’re absolutely right here in terms of the overlap, too, right? Because in a, in a way, um, I’m an organic chemist by training, and I dabble in machine learning. You’re a mechanical engineer who dabbles in chemistry. Uh, Bryan’s a computational chemist who dabbles in flow cell works. Uh, Anne is, uh, you know, a purely synthetic chemist who dabbles in, you know, almost all of our aspects. Because we have overlap, we have lower, I’m going to call it an activation barrier, [LAUGHS] in terms of the language we speak. I think that is something that, you know, we have to speak the same language, um, so that we can understand each other. And sometimes that can be really challenging, but oftentimes, it’s, it’s not.

HUIZINGA: Yeah, David, all successful research projects begin in the mind and make their way to the market. Um, where does this research sit on that spectrum from lab to life, and how fast is it moving as far as you’re concerned?

KWABI: Do you mean the research, uh, in general or this project?

HUIZINGA: This project, specifically.

KWABI: OK, so I’d say we’re, we’re still quite early at this stage. So there’s a system of classification called Technology Readiness Level, and I would say we’re probably on the low end of the scale, I don’t know, maybe like a 1 or 2.

NGUYEN: We just started six months ago!

KWABI: We just started six months ago! So …

[LAUGHTER]

HUIZINGA: OK, that’s early. Wait, how many levels are there? If there’s 1 or 2, what’s the high end?

KWABI: I think we go up to an 8 or so, an 8 or a 9. Um, so, so we’re quite early; we just started. But the, the nice thing about this field is that things can move really quickly. So in a year or two, who knows where we’ll be? Maybe four or five, but things are still early. There’s a lot of fundamental research right now that’s happening …

HUIZINGA: Which is so cool.

KWABI: Proof of concept. Which is necessary, I think, before you can get to the, the point where you’re, um, you’re spinning out a company or, or moving up to larger scales.

HUIZINGA: Right. Which lives very comfortably in the academic world. Bichlien, Microsoft Research is sort of a third space where they allow for some horizon on that scale in terms of how long it’s going to take this to be something that could be financially viable for Microsoft. Is that just not a factor right now? It’s just like, let’s go, let’s solve this problem because this is super-important?

NGUYEN: I guess I’ll say that it takes roughly 20 years or so to get a proof of concept into market at an industrial scale. So, I’m … what I’m hoping that with this collaboration, and with others, is that we can shorten the time for discovery so that we understand the fundamentals and we have a good baseline of what we think can be achieved so that we can go to, for example, a pilot scale, like a test scale, outside of the laboratory, not full industrial scale, but just a pilot scale much faster than we would if we had to hand iterate every single molecule.

HUIZINGA: So the generative models play a huge role in that shortening of the time frame …

NGUYEN: Yes, yes. That’s what we …

KWABI: Yeah, I think …

NGUYEN: Go ahead, David.

KWABI: Yeah. I think the idea of having a platform … so, so rather than, you know, you found this wonderful, precious molecule that you’re going to make a lot of, um … you know, having a platform that can generate molecules, right, I think is, you know, proving that this actually works gives you a lot more shots on goal, basically. And I think that, you know, if we’re able to show that, in the next year or two, that there’s, there’s a proof of concept that this can go forward, then um, then, in principle, we have many more chemistries to work with and play with, than the …

NGUYEN: Yeah, and, um, we might also be able to, you know, with, with this platform, discover molecules that have that dual purpose, right, of both energy storage and carbon capture.

HUIZINGA: Well, as we wrap up, I’d love to know in your fantastical ideal preferred future, what does your work look like … now, I’m going to say five to 10 years, but, Bichlien, you just said 20 years, [LAUGHS] so maybe I’m on the short end of it here. In the “future,” um, how have you changed the landscape of eco-friendly, cost-effective energy solutions?

KWABI: That’s a, that’s a big question. I, I tend to think in more two–, three–year timelines sometimes. [LAUGHS] But I think in, in, in, you know, in like five, 10 years, if this research leads to a company that’s sort of thriving and demonstrating that flow batteries can really make an impact in terms of low-cost energy storage, that would have been a great place to land. I mean that and the demonstration that you, you know, with artificial intelligence, you can create this platform that can, uh, custom design molecules that fulfill these criteria. I think that would be, um, that would be a fantastic outcome.

HUIZINGA: Bichlien, what about you?

NGUYEN: So I think in one to two years, but I also think about the 10-to-20-year timeline, and what I’m hoping for is, again, to demonstrate the value of AI in order to enable a carbon negative economy so that we can all benefit from it. It sounds very … a polished answer, but I, I really think there are going to be accelerations in this space that’s enabled by these new technologies that are coming out.

HUIZINGA: Hmm.

NGUYEN: And I hope so! We have to save the planet!

KWABI: There’s a lot more to AI than ChatGPT and, [LAUGHS] you know, language models and so on, I think …

HUIZINGA: That’s a perfect way to close the show. So … Bichlien Nguyen and David Kwabi, thank you so much for coming on. It’s been delightful—and informative!

NGUYEN: Thanks, Gretchen.

KWABI: Thank you very much.

The post Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi appeared first on Microsoft Research.

Read More

Improving Subseasonal Forecasting with Machine Learning

Improving Subseasonal Forecasting with Machine Learning

This content was previously published by Nature Portfolio and Springer Nature Communities on Nature Portfolio Earth and Environment Community.

Improving our ability to forecast the weather and climate is of interest to all sectors of the economy and to government agencies from the local to the national level. Weather forecasts zero to ten days ahead and climate forecasts seasons to decades ahead are currently used operationally in decision-making, and the accuracy and reliability of these forecasts has improved consistently in recent decades (Troccoli, 2010). However, many critical applications – including water allocation, wildfire management, and drought and flood mitigation – require subseasonal forecasts with lead times in between these two extremes (Merryfield et al., 2020; White et al., 2017).

While short-term forecasting accuracy is largely sustained by physics-based dynamical models, these deterministic methods have limited subseasonal accuracy due to chaos (Lorenz, 1963). Indeed, subseasonal forecasting has long been considered a “predictability desert” due to its complex dependence on both local weather and global climate variables (Vitart et al., 2012). Recent studies, however, have highlighted important sources of predictability on subseasonal timescales, and the focus of several recent large-scale research efforts has been to advance the subseasonal capabilities of operational physics-based models (Vitart et al., 2017; Pegion et al., 2019; Lang et al., 2020). Our team has undertaken a parallel effort to demonstrate the value of machine learning methods in improving subseasonal forecasting.

The Subseasonal Climate Forecast Rodeo

To improve the accuracy of subseasonal forecasts, the U.S. Bureau of Reclamation (USBR) and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a yearlong real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two-to-four weeks and four-to-six weeks in advance. Our team developed a machine learning approach to the Rodeo and a SubseasonalRodeo dataset for training and evaluating subseasonal forecasting systems.

Week 3-4 temperature forecasts and observations for February 5th, 2018. Upper left: Our Rodeo submission. Upper right: Realized temperature anomalies. Bottom left: Forecast of the U.S. operational dynamical model, Climate Forecasting System v2. Bottom right: A standard meteorological forecasting method used as a Rodeo baseline.
Week 3-4 temperature forecasts and observations for February 5th, 2018. Upper left: Our Rodeo submission. Upper right: Realized temperature anomalies. Bottom left: Forecast of the U.S. operational dynamical model, Climate Forecasting System v2. Bottom right: A standard meteorological forecasting method used as a Rodeo baseline.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our final Rodeo solution was an ensemble of two nonlinear regression models. The first integrates a diverse collection of meteorological measurements and dynamic model forecasts and prunes irrelevant predictors using a customized multitask model selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone outperforms the debiased operational U.S. Climate Forecasting System version 2 (CFSv2), and, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40%-50% for temperature and 129%-169% for precipitation. See our write-up Improving Subseasonal Forecasting in the Western U.S. with Machine Learning for more details. While this work demonstrated the promise of machine learning models for subseasonal forecasting, it also highlighted the complementary strengths of physics- and learning-based approaches and the opportunity to combine those strengths to improve forecasting skill.

Adaptive Bias Correction (ABC)

To harness the complementary strengths of physics- and learning-based models, we next developed a hybrid dynamical-learning framework for improved subseasonal forecasting. In particular, we learn to adaptively correct the biases of dynamical models and apply our novel adaptive bias correction (ABC) to improve the skill of subseasonal temperature and precipitation forecasts.

At subseasonal lead times, weeks 3-4 and 5-6, ABC doubles or triples the forecasting skill of leading operational dynamical models from the U.S. (CFSv2) and Europe (ECMWF).
At subseasonal lead times, weeks 3-4 and 5-6, ABC doubles or triples the forecasting skill of leading operational dynamical models from the U.S. (CFSv2) and Europe (ECMWF).

ABC is an ensemble of three new low-cost, high-accuracy machine learning models: Dynamical++, Climatology++, and Persistence++. Each model trains only on past temperature, precipitation, and forecast data and outputs corrections for future forecasts tailored to the site, target date, and dynamical model. Dynamical++ and Climatology++ learn site- and date-specific offsets for dynamical and climatological forecasts by minimizing forecasting error over adaptively-selected training periods. Persistence++ additionally accounts for recent weather trends by combining lagged observations, dynamical forecasts, and climatology to minimize historical forecasting error for each site.

ABC can be applied operationally as a computationally inexpensive enhancement to any dynamical model forecast, and we use this property to substantially reduce the forecasting errors of eight operational dynamical models, including the state-of-the-art ECMWF model.

ABC can be applied operationally as a computationally inexpensive enhancement to any dynamical model forecast.
ABC can be applied operationally as a computationally inexpensive enhancement to any dynamical model forecast.

A practical implication of these improvements for downstream decision-makers is an expanded geographic range for actionable skill, defined here as spatial skill above a given sufficiency threshold. For example, we vary the weeks 5-6 sufficiency threshold from 0 to 0.6 and find that ABC consistently boosts the number of locales with actionable skill over both raw and operationally-debiased CFSv2 and ECMWF.

ABC consistently boosts the number of locales with forecasting accuracy above a given skill threshold, an important property for operational decision-making in water allocation, wildfire management, and drought and flood mitigation.
ABC consistently boosts the number of locales with forecasting accuracy above a given skill threshold, an important property for operational decision-making in water allocation, wildfire management, and drought and flood mitigation. 

We couple these performance improvements with a practical workflow for explaining ABC skill gains using Cohort Shapley (Mase et al., 2019) and identifying higher-skill windows of opportunity (Mariotti et al., 2020) based on relevant climate variables.

a.) impact of hgt_500_pc1 on ABC skill improvement b.) forecast with largest hgt_500_pc1 impact
Our “forecast of opportunity” workflow explains ABC skill gains in terms of relevant climate variables observable at forecast time.

To facilitate future deployment and development, we also release our model and workflow code through the subseasonal_toolkit Python package.

The SubseasonalClimateUSA dataset

To train and evaluate our contiguous US models, we developed a SubseasonalClimateUSA dataset housing a diverse collection of ground-truth measurements and model forecasts relevant to subseasonal timescales. The SubseasonalClimateUSA dataset is updated regularly and publicly accessible via the subseasonal_data package. In SubseasonalClimateUSA: A Dataset for Subseasonal Forecasting and Benchmarking, we used this dataset to benchmark ABC against operational dynamical models and seven state-of-the-art deep learning and machine learning methods from the literature. For each subseasonal forecasting task, ABC and its component models provided the best performance.

Percentage improvement in accuracy over operationally-debiased dynamical CFSv2 forecasts. ABC consistently outperforms standard meteorological baselines (Persistence and Climatology) and 7 state-of-the-art machine learning and deep learning methods from the literature.
Percentage improvement in accuracy over operationally-debiased dynamical CFSv2 forecasts. ABC consistently outperforms standard meteorological baselines (Persistence and Climatology) and 7 state-of-the-art machine learning and deep learning methods from the literature.

Online learning with optimism and delay

To provide more flexible and adaptive model ensembling in the operational setting of real-time climate and weather forecasting, we developed three new optimistic online learning algorithms — AdaHedgeD, DORM, and DORM+ — that require no parameter tuning and have optimal regret guarantees under delayed feedback.

online learning regret plot
Each year, the PoolD online learning algorithms produce ensemble forecasts with accuracy comparable to the best individual model in hindsight despite observing only 26 observations per year.

Our open-source Python implementation, available via the PoolD library, provides simple strategies for combining the forecasts of different subseasonal forecasting models, adapting the weights of each model based on real-time performance. See our write-up Online Learning with Optimism and Delay for more details.

Looking forward

We’re excited to continue exploring machine learning applied to subseasonal forecasting on a global scale, and we hope that our open-source packages will facilitate future subseasonal development and benchmarking. If you have ideas for model or dataset development, please contribute to our open-source Python code or contact us!

The post Improving Subseasonal Forecasting with Machine Learning appeared first on Microsoft Research.

Read More

Accounting for past imaging studies: Enhancing radiology AI and reporting

Accounting for past imaging studies: Enhancing radiology AI and reporting

The use of self-supervision from image-text pairs has been a key enabler in the development of scalable and flexible vision-language AI models in not only general domains but also in biomedical domains such as radiology. The goal in the radiology setting is to produce rich training signals without requiring manual labels so the models can learn to accurately recognize and locate findings in the images and relate them to content in radiology reports.

Radiologists use radiology reports to describe imaging findings and offer a clinical diagnosis or a range of possible diagnoses, all of which can be influenced by considering the findings on previous imaging studies. In fact, comparisons with previous images are crucial for radiologists to make informed decisions. These comparisons can provide valuable context for determining whether a condition is a new concern or improving, deteriorating, or stable if an existing condition and can inform more appropriate treatment recommendations. Despite the importance of comparisons, current AI solutions for radiology often fall short in aligning images with report data because of the lack of access to prior scans. Current AI solutions also typically fail to account for the chronological progression of disease or imaging findings often present in biomedical datasets. This can lead to ambiguity in the model training process and can be risky in downstream applications such as automated report generation, where models may make up temporal content without access to past medical scans. In short, this limits the real-world applicability of such AI models to empower caregivers and augment existing workflows.

In our previous work, we demonstrated that multimodal self-supervised learning of radiology images and reports can yield significant performance improvement in downstream applications of machine learning models, such as detecting the presence of medical conditions and localizing these findings within the images. In our latest study, which is being presented at the 2023 IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), we propose BioViL-T, a self-supervised training framework that further increases the data efficiency of this learning paradigm by leveraging the temporal structure present in biomedical datasets. This approach enables the incorporation of temporal information and has the potential to perform complementary self-supervision without the need for additional data, resulting in improved predictive performance.

Our proposed approach can handle missing or spatially misaligned images and can potentially scale to process a large number of prior images. By leveraging the existing temporal structure available in datasets, BioViL-T achieves state-of-the-art results on several downstream benchmarks. We’ve made both our models and source code open source, allowing for a comprehensive exploration and validation of the results discussed in our study. We’ve also released a new multimodal temporal benchmark dataset, MS-CXR-T, to support further research into longitudinal modeling of medical images and text data.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Connecting the data points

Solving for the static case in vision-language processing—that is, learning with pairs of single images and captions—is a natural first step in advancing the field. So it’s not surprising that current biomedical vision-language processing work has largely focused on tasks that are dependent on features or abnormalities present at a single point in time—what is a patient’s current condition, and what is a likely diagnosis?—treating image-text pairs such as x-rays and corresponding reports in today’s datasets as independent data points. When prior imaging findings are referenced in reports, that information is often ignored or removed in the training process. Further, a lack of publicly available datasets containing longitudinal series of imaging examinations and reports has further challenged the incorporation of temporal information into medical imaging benchmarks.

Thanks to our early and close collaboration with practicing radiologists and our long-standing work with Nuance, a leading provider of AI solutions in the radiology space that was acquired by Microsoft in 2022, we’ve been able to better understand clinician workflow in the radiological imaging setting. That includes how radiology data is created, what its different components are, and how routinely radiologists refer to prior studies in the context of interpreting medical images. With these insights, we were able to identify temporal alignment of text across multiple images as a clinically significant research problem. To ground, or associate, report information such as “pleural effusion has improved compared to previous study” with the imaging modality requires access to the prior imaging study. We were able to tackle this challenge without gathering additional data or annotations.

As an innovative solution, we leveraged the metadata from de-identified public datasets like MIMIC-CXR. This metadata preserves the original order and intervals of studies, allowing us to connect various images over time and observe disease progression. Developing more data efficient and smart solutions in the healthcare space, where data sources are scarce, is important if we want to develop meaningful AI solutions.

An animated flowchart of BioViL-T. Arrows direct from a prior chest x-ray and current chest x-ray  through boxes labeled “CNN” to image embeddings, illustrated by a purple cube and a brown cube, respectively, representing relevant spatial and temporal features. An arrow points from these features through a box labeled “Vision Transformer Blocks” to a “difference embedding,” represented by a blue cube. A curly bracket pointing to a brown and blue cube labeled “image features” indicates the aggregation of the current image embedding and the difference embedding. Arrows from the “image features” cube and from an extract from a radiology report point to a text model, represented by box labeled “CXR-BERT.”
Figure 1: The proposed self-supervised training framework BioViL-T leverages pairs of radiology reports and sequences of medical images. The training scheme does not require manual expert labels and can scale to a large amount of radiology data to pretrain image and text models required for downstream clinical applications.

Addressing the challenges of longitudinal analysis

With current and prior images now available for comparison, the question became, how can a model reason about images coming from different time points? Radiological imaging, especially with planar techniques like radiographs, may show noticeable variation. This can be influenced by factors such as the patient’s posture during capture and the positioning of the device. Notably, these variations become more pronounced when images are taken with longer time gaps in between. To manage variations, current approaches to longitudinal analysis, largely used for fully supervised learning of image models only, require extensive preprocessing, such as image registration, a technique that attempts to align multiple images taken at different times from different viewpoints. In addition to better managing image variation, we wanted a framework that could be applied to cases in which prior images weren’t relevant or available and the task involved only one image.

We designed BioViL-T with these challenges in mind. Its main components are a multi-image encoder, consisting of both a vision transformer and a convolutional neural network (CNN), and a text encoder. As illustrated in Figure 1, in the multi-image encoder, each input image is first encoded with the CNN model to independently extract findings, such as opacities, present in each medical scan. Here, the CNN counteracts the large data demands of transformer-based architectures through its efficiency in extracting lower-level semantic features.

At the next stage, the features across time points are matched and compared in the vision transformer block, then aggregated into a single joint representation incorporating both current and historical radiological information. It’s important to note that the transformer architecture can adapt to either single- or multi-image scenarios, thereby better handling situations in which past images are unavailable, such as when there’s no relevant image history. Additionally, a cross-attention mechanism across image regions reduces the need for extensive preprocessing, addressing potential variations across images.

In the final stage, the multi-image encoder is jointly trained with the text encoder to match the image representations with their text counterparts using masked modeling and contrastive supervision techniques. To improve text representations and model supervision, we utilize the domain-specific text encoder CXR-BERT-general, which is pretrained on clinical text corpora and built on a clinical vocabulary.

Two chest x-rays side-by-side animated with bounding boxes and attention maps on the affected area of the lung.
Figure 2: Example of current (left) and prior (right) chest x-ray scans. The attention maps computed within the vision transformer show (in purple) how the model interprets disease progression by focusing on these image regions. In this example, the airspace disease seen in the left lung lobe has improved since the prior acquisition.

Grounded model prediction

In our work, we found that linking multiple images during pretraining makes for both better language and vision representations, enabling the AI model to better associate information present in both the text and the images. This means that when given a radiology report of a chest x-ray, for example, with the description “increased opacities in the left lower lung compared with prior examination,” a model can more accurately identify, locate, and compare findings, such as opacities. This improved alignment between data modalities is crucial because it allows the model to provide more accurate and relevant insights, such as identifying abnormalities in medical images, generating more accurate diagnostic reports, or tracking the progression of a disease over time.

Two findings were particularly insightful for us during our experimentation with BioViL-T:

  • Today’s language-generating AI models are often trained by masking portions of text and then prompting them to fill in the blanks as a means of encouraging the models to account for context in outputting a prediction. We extended the traditional masked language modeling (MLM) approach to be guided by multi-image context, essentially making the approach multimodal. This, in return, helped us better analyze whether BioViL-T was learning a progression based on provided images or making a random prediction of the masked words based solely on the text context. We gave the model radiology images and reports with progression-related language, such as “improving,” masked. An example input would be “pleural effusion has been [MASKED] since yesterday.” We then tasked the model with predicting the missing word(s) based on single and multi-image inputs. When provided with a single image, the model was unsuccessful in completing the task; however, when provided with a current and prior image, performance improved, demonstrating that the model is basing its prediction on the prior image.
  • Additionally, we found that training on prior images decreases instances of the generative AI model producing ungrounded outputs that seem plausible but are factually incorrect, in this case, when there’s a lack of information. Prior work into radiology report generation utilizes single input images, resulting in the model potentially outputting text that describes progression without having access to past scans. This severely limits the potential adoption of AI solutions in a high-stakes domain such as healthcare. A decrease in ungrounded outputs, however, could enable automated report generation or assistive writing in the future, which could potentially help reduce administrative duties and ease burnout in the healthcare community. Note that these models aren’t intended for any clinical use at the moment, but they’re important proof points to assess the capabilities of healthcare AI.

Moving longitudinal analysis forward

Through our relationships with practicing radiologists and Nuance, we were able to identify and concentrate on a clinically important research problem, finding that accounting for patient history matters if we want to develop AI solutions with value. To help the research community advance longitudinal analysis, we’ve released a new benchmark dataset. MS-CXR-T, which was curated by a board-certified radiologist, consists of current-prior image pairs of chest x-rays labeled with a state of progression for the temporal image classification task and pairs of sentences about disease progression that are either contradictory or capture the same assessment but are phrased differently for the sentence similarity task.

We focused on chest x-rays and lung diseases, but we see our work as having the potential to be extended into other medical imaging settings where analyzing images over time plays an important part in clinician decision-making, such as scenarios involving MRI or CT scans. However far the reach, it’s vital to ensure that models such as BioViL-T generalize well across different population groups and under the various conditions in which medical images are captured. This important part of the journey requires extensive benchmarking of models on unseen datasets. These datasets should widely vary in terms of acquisition settings, patient demographics, and disease prevalence. Another aspect of this work we look forward to exploring and monitoring is the potential role of general foundation models like GPT-4 in domain-specific foundation model training and the benefits of pairing larger foundation models with smaller specialized models such as BioViL-T.

To learn more and to access our text and image models and source code, visit the BioViL-T Hugging Face page and GitHub.

Acknowledgments

We’d like to thank our co-authors: Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Pérez-García, Maximilian Ilse, Daniel C. Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, Anton Schwaighofer, Maria Wetscherek, and Aditya Nori. We’d also like to thank Hoifung Poon, Melanie Bernhardt, Melissa Bristow, and Naoto Usuyama for their valuable technical feedback and Hannah Richardson for assisting with compliance reviews.

MEDICAL DEVICE DISCLAIMER

BioViL-T was developed for research purposes and is not designed, intended, or made available as a medical device and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment.

The post Accounting for past imaging studies: Enhancing radiology AI and reporting appeared first on Microsoft Research.

Read More

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

black and white photos of Emre Kiciman, Senior Principal Researcher at Microsoft Research and Amit Sharma, Principal Researcher at Microsoft Reserach, next to the Microsoft Research Podcast

Episode 140 | June 8, 2023

Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.

This episode features Senior Principal Researcher Emre Kiciman and Principal Researcher Amit Sharma, whose paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” examines the causal capabilities of large language models (LLMs) and their implications. Kiciman and Sharma break down the study of cause and effect; recount their respective ongoing journeys with GPT-3.5 and GPT-4—from their preconceptions to where they are now—and share their views of a future in which LLMs help bring together different modes of reasoning in the practice of causal inference and make causal methods easier to adopt.

Transcript

[MUSIC PLAYS]

ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale models like GPT-4 is accelerating the advancement of AI. These models are exhibiting surprising new abilities like reasoning, problem-solving, and translation across languages and domains. In this podcast series, I’ll share conversations with fellow researchers about our impressions of GPT-4, the work we’re doing to understand its capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers

Today we’re talking with Emre Kiciman and Amit Sharma, two Microsoft researchers who have been studying causal reasoning with AI for many years. Determining cause and effect relationships is critically important across many domains such as law, medicine, and the advancement of science itself. Emre and Amit recently published a paper that explores how large language models can advance the research and application of causal reasoning with AI. Emre joins us from our lab in Redmond, Washington, and Amit is on the line from Microsoft Research India, in Bangalore. 


[MUSIC FADES]

Emre, Amit, let’s jump right in. I’m so excited to speak with you both about causal reasoning. And this is such a timely conversation because we’re living through the rise of generative pretrained models, specifically large language models. And when I’ve engaged with GPT-4 in dialogue, depending on what I ask, it can appear to be doing something resembling causal reasoning. And as a machine learning person myself, I have to say this is not something that I’d expected to see from a neural network that works based on analyzing and generating statistical patterns. Um, you know, this is something that before this time last year, I thought of as a uniquely human skill as I think maybe many others have, as well. Now, both of you do this for a living. You study causal reasoning for a living. Um, and so where I’d like to start is with your first reactions to GPT-4, your first contact. What did you find surprising, and how did you feel, uh, as a researcher in this area? I want to go to Emre first on this. 

EMRE KICIMAN: Sure. Well, um, yeah, I think I went through a process. Um, right now, I am surprised how much I’m depending on functionality from GPT-4 and how much I expect it to work. And yet, I also don’t quite believe that it can do the things that it’s doing. It’s really, um, a weird mind space to be in. I think the, the moment when I was a bit astounded by, like, what might be possible was actually before I got my hands on GPT-4 directly. You know, I’ve been hearing that people were very impressed with what it was doing. But the thing that made me reconsider my preconceptions was actually some of the academic research looking at, um, how transformer models and architectures could actually represent Turing machines, Turing-complete computational machines. And once I saw that the transformer architecture could represent that type of program, that type of thing, then I figured, well, all bets are off. We don’t know whether it’s learning this or not, but if it can represent it, now there really is a chance that it could, that it might be learning that. And so we have to really keep an open mind.

The second moment when I changed my mind again about what GPT-4 might be doing … so I’ll give a little background. So once I saw some of the work that we’ll talk about here, uh, coming into play, where we’re seeing GPT do some sorts of, you know, very interesting causal-related tasks, um, I was like, OK, this is great. We have our causal processes; we’re just going to run through them and this fits in. Someone will come with their causal question; we’ll run through and run our, our causal analysis. And I thought that, you know, this all makes sense. We can do things that we want, what we’ve wanted to do for so, for so long. And it was actually reading, uh, some of the vignettes in Peter Lee’s book where he was quizzing, uh, GPT-4 to diagnose a patient based on their electronic health records, explain counterfactual scenarios, um, think through why someone might have made a misdiagnosis. And, and here, all of a sudden, I realized our conceptualizations of causal tasks that we’ve worked on in the academic fields are kind of boxes where we say we’re doing effect inference or we’re doing attribution or we’re doing discovery. These like very well-circumscribed tasks are, are not enough; they’re not flexible enough. Once you have this natural language interface, you can ask so many more things, so many more interesting questions. And we need to make sure that we can formally answer those … correctly answer those questions. And, and this GPT-4 is basically a bridge to expressing and, you know, meeting people where they want to be. That really opened my eyes the second time. 

LLORENS: Thanks, Emre. Amit, first impressions. 

AMIT SHARMA: Yeah, my experience was back in December—I think it was when a lot of people were talking about ChatGPT—and me, thinking that I worked in causality, uh, I was quite smug, right. I knew that causality requires you to have interventional data. Language models are only built on some observations. So I was quite happy to think that I would beat this topic, right. But it was just that every day, I would see, perhaps on Twitter, people expressing new things that ChatGPT can do that one day, I thought, OK, let me just try it, right. So the first query I thought was an easy query for, uh, GPT models. I just asked it, does smoking cause lung cancer, right? And I was surprised when it gave the right answer. But then I thought maybe, oh, this is just too common. Let me ask the opposite. Does lung cancer cause smoking? Uh, it gave the right answer. No. Uh, and then I was literally struck, and I, and I thought, what else can I test, right? And then I thought of the all the causal relationships that we typically talk about in our field, and I started doing them one by one. And what I found was that the accuracy was just astounding. And it was not just the accuracy, but also the explanation that it gives would sort of almost make you believe that as if it is a causal agent, as if it is doing, uh, something causal. So, so to me, I think those few days in December with slightly sleepless nights on what exactly is going on with these models and what I might add … what am I going to do as a researcher now? [LAUGHS] I think that was, sort of, my initial foray into this. And, and I think the logical next step was then to study it more deeply. 

LLORENS: And stemming from both of your reactions, you began collaborating on a paper, which you’ve recently released, called “Causal Reasoning [and] Large Language Models,” um, and I’ve had the, you know, the pleasure of spending some time with that over these last few days and, and a week here. And one of the things you do in the paper is you provide what I think of as a helpful overview of the different kinds of causality. And so, Emre, I want to go back to you. What is causality, and how can we think about the space of different, you know, kinds of causal reasoning?

KICIMAN: Causality … it’s the study of cause-and-effect relationships, of the mechanisms that, that drive, you know, what we see happening in the world around us. You know, why do things happen? What made something happen? And this is a study that spread out across so many disciplines—computer science, economics, health, statistics. Like, everyone cares about, about causality, to some degree. And so this means that there’s many different kinds of, you know, tools and languages to talk about causality, um, that are appropriate for different kinds of tasks. So that’s one of the first things that we thought we had to lay out in the paper, was kind of a very broad landscape about what causality is. And so we talk about a couple of different axes. One is data-driven causal analysis, and the other is logic-based causal reasoning. These are two very different ways of, of, of thinking about causality. And then the second major axis is whether we’re talking about causal relationships in general, in the abstract, like, uh, does smoking normally cause … or often cause cancer? Versus causality in a very specific context— that’s called actual causality. And this is something like Bob smoked; Bob got lung cancer. Was Bob’s lung cancer caused by Bob’s smoking? It’s a very specific question in this very, you know, in, in a specific instance. And so those are the two axes: data-driven versus logic and then general causality versus actual causality. 

LLORENS: Amit, I want to go to you now, and I want to dwell on this topic of actual causality. And I actually learned this phrase from your paper. But I think this is a kind of causal reasoning that people do quite often, maybe even it’s the thing they think about when they think about causal reasoning. So, Amit, you know, let’s go deeper into what actual causality is. Maybe you can illustrate with some examples. And then I want to get into experiments you’ve conducted in this area with GPT-4. 

SHARMA: Sure. So interestingly, actual causality in research is sort of the less talked about. As Emre was saying, I think most researchers in health sciences, economics often talk about general phenomena. But actual causality talks about events and what might have caused them, right. So think about something happens in the real world. So let’s say … I’ll take an example of, let’s say, you catch a ball and you prevent it from falling down, right. And I think people would reasonably argue that your catching the ball was the cause of preventing it from falling onto the ground. But very quickly, these kinds of determinations become complex because what could have been happening is that there could be multiple other factors at play, uh, and there could also be questions about how exactly you’re even thinking about what is a cause. Should, should you be thinking about necessary causes, or should you be thinking about sufficient causes, and so on. So, so I think actual causality before sort of these language models was kind of a paradox in the sense that the applications were kind of everywhere, going from everyday life to even thinking about computer systems. So if your computer system fails, you want to understand why this failure occurred, right. You’re not really interested in why computer systems fail in general; you’re just interested in answering the specific failure’s causes. And the paradox is that even though these sort of questions were so common, I think what research had to offer, uh, was not immediately systemizable or deployable, uh, because you would often sort of tie yourself in knots in defining exactly what you mean by the cause and also sort of how do you even get that framing without sort of just having a formal representation, right. Most of these tasks were in English, right, or in the case of computer systems, you would just get a debug log. So I think one of the hardest problems was how do you take something in vague language, human language, and convert it into sort of logical framing or logical systems? 

LLORENS: In the paper, you explore briefly, you know, kind of actual causality that deals with responsibility or faults. And, you know, this connects with things like, you know, reasoning in the, in the legal domain. And so I just want to, I want to explore that with you. And I know I’ve jumped to the back of the paper. I just find these particular set … this particular set of topics pretty fascinating. And so tell me about the experiments that you’ve conducted where you ask, you know, the, the algorithm … the model to do this kind of actual causal reasoning around assigning blame or responsibility for something? 

SHARMA: So one of the important challenges in actual causality is determining what’s a necessary cause and what’s a sufficient cause for an event, right. Now if you’re familiar with logic, you can break this down into sort of simple predicates. What we are asking is if an event happened, was some action necessary? It means that if that action did not happen, then that event would not happen, right. So we have a nice ”but for” relationship. Sufficiency, on the other hand, is kind of the complement. So there you’re saying if this action happens, the event will always happen, irrespective of whatever else happens in the world, right. And so, so far, in actual causality, people would use logic-based methods to think about what’s the right answer for any kind of event. So what we did was we looked at all the sort of vignettes or these examples that causality researchers had collected over the past decade. All of these are very challenging examples of situations in English language. And I think their purpose was to kind of elucidate the different kinds of sort of gotchas you get when you try to sort of just use the simple concept for real-world applications. So let me take you through one example in our dataset that we studied and how we’re finding that LLMs are somehow able to take this very vague, ambiguous information in an English-language vignette and directly go from sort of that language to an answer in English, right. So in a sense, they’re kind of sidestepping the logical reasoning, but maybe in the future we can also combine logical reasoning and LLMs. 

So let’s take an example. Uh, it’s like Alice catches a ball. The next part on … the next destination on the ball’s trajectory was a brick wall, which would have stopped it, and beyond that there was a window. So as humans, we would immediately think that Alice was not a cause, right, because even if she had not stopped the ball, it would have hit the brick, and so if you’re asking if Alice was the cause of the window being safe, an intuitive answer might be no. But when you analyze it through the necessary and sufficient lens, you would find that Alice was obviously not a necessary cause because the brick wall would have stopped it, but Alice was a sufficient cause, meaning that if Alice had stopped the ball, even if the brick wall collapsed, even if other things happened in the world, the window would still be safe right. So these are the kind of sort of interesting examples that we tried out. And what we found was GPT-3.5, which is ChatGPT, does not do so well. I think it actually fails to identify correctly these causes, but GPT-4 somehow is able to do that. So it gets about 86 percent accuracy on, on this task. And one of the interesting things we were worried about was maybe it’s just memorizing. Again, these are very popular examples in textbooks, right? So we did this fun thing. We just created our own dataset. So, so now instead of Alice catching a ball, Alice could be, I don’t know, dropping a test tube in a lab, right? So we created this sort of a lab setup—a completely new dataset—and we again found the same results that GPT-4 is able to infer these causes. 

LLORENS: Now you’re, you’re getting into experimental results, and that’s great because one of the things that I think required some creativity here was how you actually even structure, you know, a rigorous set of experiments. And so, Emre, can you take … take us through the experiment setup and how you had to approach that with this, you know, kind of unique, unique way of assessing causal reasoning? 

KICIMAN: Well, one of the things that we wanted to make sure we had when we were running these experiments is, uh, construct validity to really make sure that the experiments that we were running were testing what we thought they were testing, or at least that we understood what they actually were testing. Um, and so most of these types of, uh, tests over large language models work with benchmark questions, and the biggest issue with the, with many of these benchmark questions is that often the large language models have seen them before. And there’s a concern that rather than thinking through to get the right answer, they’ve really only memorized the specific answers to these, to these specific questions.

And so what we did was, uh, we actually ran a memorization test to see whether the underlying dataset had been memorized by the large language model before. We developed … some of our benchmark datasets we developed, uh, as novel datasets that, you know, had never been written before so clearly had not been seen or memorized. And then we ran additional tests to help us understand what was triggering the specific answers. Like we would redact words from our question, uh, to see what would lead the LLM to make a mistake. So, for example, if we remove the key word from the question, we would expect the LLM to be confused, right. That’s, that’s fine. If we removed an unimportant word, maybe, you know, a participle or something, then we would expect that, that, that, that should be something that the LLM should recover from. And so this was able to give us a better understanding of what the LLM was, was paying attention to. This led us, for example, to be very clear in our paper that in, for example, our causal discovery experiments—where we are specifically asking the LLM to go back to its learned knowledge and tell us whether it knows something from common sense or domain knowledge, whether it’s memorized that, you know, some, uh, some cause, uh, has a particular effect—we are very clear in our experiments that we are not able to tell you what the odds are that the LLM has memorized any particular fact. But what we can say is, given that it’s seen that fact, is it able to transform it, you know, and combine it somehow into the correct answer in a particular context. And so it’s just, it’s really important to, to know what, uh, what these experiments really are testing. So I, I really appreciated the opportunity to go a little bit deeper into these studies.

LLORENS: I find this concept of construct validity pretty fascinating here, and it’s, you know, you, you stressed the importance of it for doing this kind of black-box testing, where you don’t actually have an explicit model for how the, well, the model is doing what it’s doing. And, you know, you talked about memorization as one important test where you’re, you know, you want to, you want to have a valid construct. But I think even deeper than that, there’s, there’s an aspect of your mental model, your beliefs about, you know, what the algorithm is doing and how relevant the testing you’re doing would be to future performance or performance on future tasks. And so I wonder if we can dwell on this notion of construct validity a little bit, maybe even one level deeper than the memorization, you know, you and your mental model of what’s happening there and why that’s important. 

KICIMAN: My mental model of what the large language model is giving us is that it’s read so much of the text out on the internet that it’s captured the common sense and domain knowledge that we would normally expect only a human to do. And through some process—maybe it’s, maybe it’s probabilistic; maybe it’s some more sophisticated reasoning—it’s able to identify, like Amit said, the most important or relevant relationships for a particular scenario. So it knows that, you know, when we’re talking about a doctor washing his or her hands with soap or not, that infection, uh, in a patient is the next … is something that’s really critical. And maybe if we weren’t talking about a doctor, this would not be, you know, the most important consideration. So it is starting from capturing this knowledge, remembering it somehow in its model, and then recognizing the right moment to recall that fact and put it back out there as part of its answer. Um, that’s, that’s my mental model of what I think it’s doing, and we are able to demonstrate with our, you know, experiments that it is transforming from many different input data formats into, you know, answers to our natural language questions. So we, we have data we think it’s seen that’s in tabular format or in graphical formats. Um, and, you know, it’s, it’s impressive to see that it’s able to generate answers to our questions in various natural language forms. 

LLORENS: I want to go now to a different kind of causality, causal discovery, which you describe in your paper as dealing with variables and their effect on each other. Emre, we’ll stick with you. And I also think that this is a, a kind of causal reasoning that maybe is closer to your day job and closer to the kinds of models maybe that you construct in the problems that you deal with. And so tell me about causal discovery and, you know, what you’re seeing in terms of the capabilities of GPT-4 and your, your experimentation. 

KICIMAN: Yeah. So causal discovery is about looking at data, observational data, where you’re not necessarily intervening on the system—you’re just watching—and then from that, trying to figure out what relationships … uh, what the causal relationships are among the factors that you’re observing. And this is something that usually is done in the context of general causality, so trying to learn general relationships, uh, between factors, and it’s usually done in a, in a databased way—looking at the covariances, statistical covariances, between your observations. And, uh, there’s causal discovery algorithms out there. Uh, there are … this is something that’s been studied for decades. And there’s essentially, uh, testing statistical independence relationships that, you know, if something isn’t causing something else, then if you hold everything constant, there should be statistical independence between those two factors or different kinds of statistical independence relationships depending on what type of causal structures you see in, uh, among the relationships. And what these algorithms are able to do, the classical algorithms, is they can get you down to, um, a set of, a set of plausible relationships, but there’s always some point at which they can’t solve … uh, they can’t distinguish things based on data alone. They can, you know … there’s going to be a couple of relationships in your dataset where they might not know whether A is causing B or B is causing A, vice versa. And this is where a human comes in with their domain knowledge and has to make a declaration of what they think the right answer is based on their understanding of system mechanics. So there’s always this reliance on a human coming in with domain knowledge. And what, what we’re, uh, seeing now, I think, with LLMs is for the first time, we have some sort of programmatic access to this common sense and domain knowledge, just like in the actual causality setting. We have it provided to us again, uh, in the causal discovery setting. And we can push on this further. We don’t have … we can, if we want, run our data analysis first, then look at the LLM to, um, to disambiguate the last couple of things that we couldn’t get out of data. But we can also start from scratch and just ask, uh, the LLM to orient all of these causal edges and identify the right mechanisms from the beginning, just solely based on common sense and domain knowledge. 

And so that’s what we did in our experiments here. We went through, uh, lists of edges and then larger graph structures to see how much we could re-create from, uh, just the common sense or domain knowledge that’s captured inside the LLM. And it did, it did quite well, beating the state of the art of the data-oriented approaches. Now, to be clear, it’s not doing the same task. If you have some data about a phenomenon that’s never been studied before, it’s not well understood, it’s never been named, the large language model is not going to be able to tell you—I don’t think it’s going to be able to tell you—what that causal relationship is. But for the many things that we do already know, it, it beats, you know, looking at the data. It’s, it’s quite impressive that way. So we think this is super exciting because it really removes this burden that we’ve really put on to the human analyst before, and now, now we can run these analyses, these … this whole data-driven process can be, uh, uh, built off of common sense it’s already captured without having to ask a user, a human, to type it all up correctly. 

LLORENS: Amit, one of the things I found fascinating about the set of experiments that you, that you ran here was the prompt engineering and just the effect on the experimental results of different ways of prompting the model. Take us through that experience and, and please do get specific on the particular prompts that you used and their effects on the outcome. 

SHARMA: Sure, yeah, this was an iterative exercise for us, as well. So as I was mentioning [to] you, when I started in December, um, the prompt I used was pretty simple: does changing A cause a change in B, right? So if you’re thinking of, let’s say, the relationship between altitude and temperature, it would just translate to a single sentence: does changing the altitude change the temperature? As we sort of moved into working for our paper and as we saw many different prompt strategies from other works, we started experimenting, right, and one of the most surprising things—actually shocking for us—was that if you just add … in these GPT-3.5 and 4 class of models, there’s a system prompt which sort of you can give some meta instructions to, to the model, and we just added a single line saying that “you are an expert in causal reasoning.” And it was quite shocking that just that thing gave us a 5-percentage point boost in the accuracy on the datasets that we were testing. So there’s something there about sort of prompting or kind of conditioning the model to be generating text more attuned with causality, which we found as interesting. It also sort of suggests that maybe the language model is not the model here; maybe it’s the prompt plus a language model, uh, meaning that GPT-4 with a great prompt could give you great answers, but sort of there’s a question of robustness of the prompt, as well. And I think finally, the prompt that we went for was an iteration on this, where instead of asking two questions—because for each pair we can ask, does A cause B or does B cause A—we thought of just making it one prompt and asking it, here are two variables, let’s say, altitude and temperature. Which direction is more likely? And so we just gave it two options or three options in the case of no direction exists. And there were two benefits to this. So, one, I think somehow this was, uh, increasing the accuracy even more, perhaps because choosing between options becomes easier now; you can compare which one is more likely. But also we could ask the LLM now to explain its reasoning. So we would ask it literally, explain it step by step going from the chain of thought reasoning. And its answers would be very instructive. So for example, some of the domains we tested, uh, we don’t know anything about it, right. So there was one neuropathic pain dataset, which has nodes called radiculopathy, DLS , lumbago. We have no idea, right. But just looking at the responses from the LLM, you can both sort of get a peek into what it’s doing at some high level maybe, but also understand the concepts and think for yourself whether those sorts of things, the reasoning, is making sense or not, right. And of course, we are not experts, so we may be fooled. We might think this is doing something. But imagine a doctor using it or imagine some expert using it. I think they can both get some auxiliary insight but also these explanations help them debug it. So if the explanation seems to be off or it doesn’t make sense, uh, that’s also a nice way of sort of knowing when to trust the model or not. 

KICIMAN: One of the things that we noticed with these prompts is that, you know, there’s more to do in this space, too. Like the kinds of mistakes that it’s making right now are things that we think might be resolved at least, you know, in some part with additional prompting or thinking strategies. For example, one of the mistakes was, um, about … when we asked about the relationship between ozone and levels in radiation levels, and it answered wrong. It didn’t answer what, what was expected in the benchmark. But it turns out it’s because there’s ambiguity in the question. The relationship between ozone and radiation, uh, is one direction if you’re talking about ozone at ground level in a city, and it’s the other direction if you’re talking about ozone in the stratosphere. And so you can ask it, is there any ambiguity here? Is there any additional information you would need that would change the direction of the causal mechanism that you’re, you know, suggesting? And it’ll tell you; it’ll say, if we’re talking about in the stratosphere, it’s this; if it’s on the ground, it’s this. And so there’s really … I think we’re going to see some really fun strategies for improving the performance further by digging into these types of interrogations. 

LLORENS: You know, the model is a kind of generalist in a way that most people are not or—I’m just going to go for it—in a way that no person is. You know, with all this knowledge of law and culture and economics and so many other … code, you know, so many other things, and I could imagine showing up and, yeah, a little bit of a primer on, a briefing on, well, here’s why you’re here and what you’re doing … I mean, that’s helpful for a person. And I imagine … and as we see, it’s helpful for these generalist, you know, general-purpose reasoners. And of course, mechanistically, what we’re doing is through the context, we’re inducing a different probability distribution over the tokens. And so I guess that’s … no, that’s what’s happening here. This is the primer that it gets before it steps into the room and, and does the Q&A or gives the talk, you know, as, as, as we do. But I want to get into a little bit now about where you see this going from here—for the field and for you as a researcher in the field. Let’s, let’s stick with you, Emre. Where do we go from here? What are some of the exciting frontiers? 

KICIMAN: What I’m most excited about is this opportunity I think that’s opening up right now to fluidly, flexibly go back and forth between these different modes of causality. Going from logic-based reasoning to data-based reasoning and going beyond the kind of set tasks that we have well-defined for, for us in our field right now. So there’s a fun story that I heard when I was visiting a university a couple of months ago. We were talking about actual causality and connections to, to database causality, and this person brought up this scenario where they were an expert witness in a case where a hedge fund was suing a newspaper. The newspaper had run an exposé of some kind on the hedge fund, scared off all of their investors, and the hedge fund went belly-up. And the hedge fund was blaming the newspaper and wanted, you know, compensation for this, right. But at the same time, this was in the middle of a financial crisis. And so there’s this question of wouldn’t the hedge fund have failed anyway? A lot of other hedge funds did. Plus there’s the question of, you know, how much of an effect do newspaper stories like this usually have? Could it possibly have killed the hedge fund? And then there’s all the, you know, questions of normality and, you know, morality and stuff of maybe this is what the newspaper is supposed to be doing anyway. It’s not their fault, um, what the consequences were. So now you can imagine asking this question, starting off in this logical, you know, framing of the problem; then when you get down to this sub-element of what happened to all the other hedge funds—what would have happened to this hedge fund if, um, if the newspaper hadn’t written a story?—we can go look at the data of what happened to all the other hedge funds, and we can run the data analysis, and we can come back. We can go back and forth so much. I think that kind of flexibility is something I’m really going to be excited to see us, you know, able to automate in some fashion. 

LLORENS: Amit, what do you think? Where do we go from here? 

SHARMA: Yeah, I think I’m also excited about the practical aspects of how this might transform the causal practice. So, for example, what Emre and I have worked a lot on, this problem of estimating the causal effect, and one of the challenges that has been in the field for a long time is that we have great methods for estimating the causal effect once we have the graph established, but getting that graph often is a really challenging process, and you need to get domain expertise, human involvement, and often that means that a lot of the causal analysis does not get done just because the upfront cost of building a graph is just too much or it’s too complex. And the flipside is also that it’s also hard to verify. So suppose you assume a graph and then you do your analysis; you get some effect like this policy is better, let’s say. It’s very hard to evaluate how good your graph was and how maybe there are some checks you can do, robustness checks, to, to validate that, right.

And so what I feel the opportunity here is that the LLMs are really being complementary to what we are already good at in causal inference, right? So we’re only good at, given a graph, getting you an estimate using statistics. What the LLMs can come in and do is help domain experts build the graph much, much faster. So now instead of sort of thinking about, “Oh, what is my system? What do I need to do?” Maybe there’s a documentation of your system somewhere that you just feed into an LLM, and it provides you a candidate graph to start with. And at the same time, on the backend, once you have estimated something, a hard challenge that researchers like us face is what might be good robustness checks, right. So often these are … one example is a negative control, where you try to think of what is something that would definitely not cause the outcome. I know it from my domain knowledge. Let me run my analysis through assuming if that was the action variable, and then my analysis should always give an answer of zero. But again, like sort of figuring out what such variables are is more of an art than science. And I think in the preliminary experiments that we are doing, the LLMs could also help you there; you could again sort of give your graph and your data … and your sort of data description, and the LLMs can suggest to you, “Hey, these might be the variables that you can use for your robustness check.” So I’m most excited about this possibility of sort of more and more adoption of causal methods because now the LLMs can substitute or at least help people to stand up these analyses much faster. 

LLORENS: Thank you both for this fascinating discussion. Understanding cause-and-effect relationships is such a fundamental part of how we apply human intelligence across so many different domains. I’m really looking forward to tracking your research, and the possibilities for more powerful causal reasoning with AI.

The post AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma appeared first on Microsoft Research.

Read More

Research Focus: Week of June 5, 2023

Research Focus: Week of June 5, 2023

Microsoft Research Focus 17 | Week of June 5, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

PODCAST 

The GPT-x Revolution in Medicine, with Peter Lee 

Microsoft Research’s Peter Lee recently sat down to discuss the impact of GPT-4 and large language models in medicine on physician-scientist Eric Topol’s Ground Truths podcast. Drawing from Lee’s recent book, The AI Revolution in Medicine, the conversation includes his early experimentation with GPT-4 and his views of its potential as well as its weaknesses. 

For example: 

  • GPT-4 excels at evaluating and reviewing content, insightfully spotting inconsistencies and missing citations, and perceiving a lack of inclusivity and diversity in terminology 
  • GPT-4 can help reduce medical errors and coach physicians to consider different diagnoses and show greater empathy to patients 
  • GPT-4 has the potential to empower patients with new tools and to democratize access to expert medical information 
  • AI needs appropriate regulation, particularly in the field of medicine 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

NEW RESEARCH 

SoK: Let the Privacy Games Begin! A Unified Treatment of Data Inference Privacy in Machine Learning 

Deploying machine learning models in production may allow adversaries to infer sensitive information about training data. Inference risks range from membership inference to data reconstruction attacks. Inspired by the success of games in cryptography to study security properties, some authors describe privacy inference risks in machine learning using a similar game-based formalism. However, adversary capabilities and goals are often stated in subtly different ways from one presentation to the next, which makes it hard to relate and compose results. 

In a new research paper, SoK: Let the Privacy Games Begin! A Unified Treatment of Data Inference Privacy in Machine Learning, researchers from Microsoft present a game-based framework to systematize the body of knowledge on privacy inference risks in machine learning. In the paper, which was presented at the 2023 IEEE Symposium on Security and Privacy, the authors use this framework to (1) provide a unifying structure for definitions of inference risks, (2) formally establish known relations among definitions, and (3) uncover hitherto unknown relations that would have been difficult to spot otherwise. 


NEW RESEARCH 

Analyzing Leakage of Personally Identifiable Information in Language Models

Language models (LMs) are widely deployed for performing several different downstream tasks. However, they have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. Understanding the risk of LMs leaking personally identifiable information (PII) has received less attention. Dataset curation techniques such as scrubbing reduce, but do not prevent, the risk of PII leakage—in practice, scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. On the other hand, it is unclear to what extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent PII disclosure.  

In a new research paper, Analyzing Leakage of Personally Identifiable Information in Language Models, researchers from Microsoft introduce rigorous game-based definitions for three types of PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM. In the paper, which was presented at the 2023 IEEE Symposium on Security and Privacy, they empirically evaluate the attacks against GPT-2 models fine-tuned with and without defenses in three domains: case law, health care, and e-mail.  

Their findings show that differential privacy can largely, but not completely, mitigate PII leakage. Traditional data curation approaches such as PII scrubbing are still necessary to achieve sufficient protection. The authors advocate for the design of less aggressive PII scrubbing techniques that account for the protection afforded by DP and achieve a better privacy/utility trade-off. 


NEW RESEARCH 

Automatic Prompt Optimization with “Gradient Descent” and Beam Search

Large Language Models (LLMs) have shown impressive performance as general-purpose agents, but their abilities remain highly dependent on hand-written prompts, which require onerous trial-and-error work. Automatic or semiautomatic procedures would help people write the best prompts while reducing manual effort. In a recent research paper, Automatic Prompt Optimization with “Gradient Descent” and Beam Search, researchers from Microsoft propose a simple and nonparametric solution to this problem. Automatic Prompt Optimization (APO) is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language “gradients” that criticize the current prompt. The gradients are then “propagated” into the prompt by editing it in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that APO can outperform prior prompt editing techniques and improve an initial prompt’s performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions. 

The post Research Focus: Week of June 5, 2023 appeared first on Microsoft Research.

Read More

3D telemedicine brings better care to underserved and rural communities, even across continents

3D telemedicine brings better care to underserved and rural communities, even across continents

Introduction

Providing healthcare in remote or rural areas is challenging, particularly specialized medicine and surgical procedures. Patients may need to travel long distances just to get to medical facilities and to communicate with caregivers. They may not arrive in time to receive essential information before their medical appointments and may have to return home before they can receive crucial follow-up care at the hospital. Some patients may wait several days just to meet with their surgeon. This is a very different experience from that of urban or suburban residents or people in more developed areas, where patients can get to a nearby clinic or hospital with relative ease.

In recent years, telemedicine has emerged as a potential solution for underserved remote populations. The COVID-19 pandemic, which prevented many caregivers and patients from meeting in person, helped popularize virtual medical appointments. Yet 2D telemedicine (2DTM) fails to fully replicate the experience of a face-to-face consultation.

To improve the quality of virtual care, researchers from Microsoft worked with external partners in Scotland to conduct the first validated clinical use of a novel, real-time 360-degree 3D telemedicine system (3DTM). This work produced three studies beginning in 2020, in which 3DTM based on Microsoft’s HoloportationTM communication technology outperformed a 2DTM equivalent. Building on the success of this research, the collaborators conducted a follow-up trial in 2022 with partners in Ghana, where they demonstrated the first intercontinental use of 3DTM. This research provides critical progress toward increasing access to specialized healthcare for rural and underserved communities.

3DTM beats 2DTM in Scotland trials

The dramatic expansion of virtual medicine helped fill a void created by COVID restrictions, but it also underscored the need for more realistic remote consultations. While 2DTM can extend the reach of specialized medicine, it fails to provide doctors and surgeons with the same quantity and quality of information they get from an in-person consultation. Previous research efforts had theorized that 3DTM could raise the bar, but the advantages were purely speculative. Until now, real-time 3DTM had been proposed within a research setting only, because of constraints on complexity, bandwidth, and technology.

In December 2019, researchers from Microsoft began discussing the development of a 3DTM system leveraging Microsoft Holoportation™ communication technology with collaborators from the Canniesburn Plastic Surgery Unit in Glasgow, Scotland, and Korle Bu Teaching Hospital (KBTH) in Accra, Ghana.

With the emergence of COVID-19 in early 2020, this effort accelerated as part of Microsoft Research’s COVID response, with the recognition that it would allow patients, including those with weakened immune systems, to visit a specialist remotely from the relative safety of a local physician’s office, rather than having to travel to the specialist at a hospital with all the concurrent risk of infection.

The initial research included a deployment in Scotland, with 10 specialized cameras capturing patient images, combining them into a 3D model, and transmitting the 3D image to a medical professional. The patient could view the same images as their doctor, which allowed them to discuss them in real time—almost as if they were in the same room.

3D telemedicine - patient interacting with clinician on-screen in real-time
Figure 1: A patient participates in a consultation with doctors using the 3D Telemedicine system. The screen allows the patient to view the same images as the clinician.

This work produced three separate studies: a clinician feedback study (23 clinicians, November–December 2020), a patient feedback study (26 patients, July–October 2021), and a study focusing on safety and reliability (40 patients, October 2021–March 2022).

Participatory testing demonstrated improved patient metrics with 3DTM versus 2DTM. Although patients still prefer face-to-face visits, 3DTM was rated significantly higher than 2DTM. Overall patient satisfaction increased to 88 percent with 3DTM from 51 percent with 2DTM; realism, or “presence,” rated higher at 80 percent for 3DTM versus 53 percent for 2DTM; and quality as measured by a Telehealth Usability Questionnaire came in at 85 percent for 3DTM compared with 77 percent for 2DTM. Safety and clinical concordance of 3DTM with a face-to-face consultation were 95 percent – equivalent to or exceeding estimates for 2DTM.

3D Telemedicine - Three graphics displayed side-by-side. The first one describes the three studies performed: Clinician feedback study (23 clinicians, Nov-Dec 2020), Patient feedback study (26 patients, Jul-Oct 2021) and Cohort study: safety & reliability (40 patients, Oct 21-Mar 22). It also has a picture of a monitor displaying a 3D model of a patient and the corresponding photo of the individual using the system. The second graphic is titled
Figure 2: In three studies produced during a trial in Scotland, 3D telemedicine outperformed 2D telemedicine in satisfaction, realism and quality, with a direct correlation between realism and satisfaction.

One of the ultimate goals of telemedicine is to bring the quality of remote consultations closer to face-to-face experiences. This data provides the first evidence that Microsoft’s Holoportation™ communication technology moves 3DTM closer to this goal than a 2D equivalent.

“We showed that we can do it using off-the-shelf components, making it affordable. And we can deploy it and make it reliable enough so that a doctor or a clinical team could use it to conduct consultations,” said Spencer Fowers, Principal Researcher at Microsoft Research.

Ghana study: 3DTM brings doctors and patients closer

After the successful deployment in Scotland, the team turned its focus to Ghana. The research team visited KBTH in February 2022. That began the collaboration on the next phase of the project and the installation of the first known 3D telemedicine system on the African continent.

Ghana has a population of 31 million people but only 16 reconstructive surgeons, 14 of whom work at KBTH. It’s one of the largest hospitals in west Africa and the country’s main hospital for reconstructive surgery and burn treatment. Traveling to Accra can be difficult for people who live in rural areas of Ghana. It may require a 24-hour bus ride just to get to the clinic. Some patients can’t stay long enough to receive follow-up care or adequate pre-op preparation and counseling. Many people in need of surgery never receive treatment, and those that do may receive incomplete or sub-optimal follow-up care. They show up, have surgery, and go home.

“As a doctor, you typically take it for granted that a patient will come back to see you if they have complications. These are actually very complex operations. But too often in Ghana, the doctors may never see the patient again,” said Stephen Lo, a reconstructive surgeon at the Canniesburn Plastic Surgery and Burns Unit in Scotland. Lo has worked for years with KBTH and was the project’s clinical lead in Glasgow.

The researchers worked with surgical team members in Scotland and Ghana to build a portable system with enhanced lighting and camera upgrades compared to the original setup deployed in Scotland. This system would enable patients to meet in 3D with doctors in Scotland and in Ghana, both before and after their surgeries, using Microsoft Holoportation™ communication technology.

3D Telemedicine - A graphic titled
Figure 3: As part of a multidisciplinary team (MDT), doctors in Glasgow visit with patients virtually both before and after their in-person visits at the clinic in Accra. Clinicians in Accra manage follow-up care on site.

The results were multiple successful multidisciplinary team (MDT) engagements—both pre-operative and post-operative—supporting surgeries led by visiting doctors from Scotland at KBTH. The 3DTM system using Microsoft  Holoportation™ communication technology helped doctors communicate to patients precisely what their surgery would entail ahead of time and then ensure that patients had access to any necessary follow-up procedures and post-operation therapy. The medical team in Glasgow used Microsoft Holoportation™ communication technology to manipulate and mark up 3D images of their patients. Patients watching from Accra could visualize the procedure, including the exact locations where the surgical incisions would occur.

3D Telemedicine - Comparison graphic showing the step-by-step process of the Traditional approach versus the International 3D approach. In the International 3D approach, there is a pre-visit 3D international MDT meeting before the on-site clinic, followed by patient consent and the surgical procedure. Additionally, the International 3D approach incorporates post-operative virtual MDT meetings, unlike the traditional approach which relies solely on local follow-up.
Figure 4: 3DTM enables better planning, safety, and integration among the international team, plus better patient education and follow-up care.

For a patient who came to KBTH to address a chronic problem with his jaw, this visualization gave him a much better understanding than he had had with previous surgeries, said Levi Ankrah​, a reconstructive surgeon at KBTH​ who participated in the remote consultations and the surgeries in Ghana.

“These are quite complex things to explain. But when the patient could actually see it for himself from the outside, that helped him feel more involved with his care and his follow-up plan,” Ankrah said.

Two pictures captured at the Telemedicine rig in Accra. The first photo depicts a male black patient seated inside the rig, with multiple Azure Kinect cameras positioned around him. He is engaged in a conversation with a doctor standing beside him. In the second photo, two male doctors are seen focused on a monitor displaying a 3D model of a patient. One doctor in surgical attire is observing the screen, while the other in street clothes is seated in a chair, manipulating the system interface.
Figure 5: A 3D consultation between a patient in Ghana using “the rig” and doctors in Scotland, who can see the patient and transmit details about his upcoming surgery.

Conclusion

One of the ultimate goals of telemedicine is for the quality of remote consultations to get closer to the experience of face-to-face consultations. The data presented in this research suggests significant potential in moving closer to the experience of face-to-face consultations, which is particularly relevant to specialties with a strong 3D focus, such as reconstructive surgery.

Nothing can replace the authenticity and confidence that come from a face-to-face visit with a doctor. But 3DTM shows great promise as a potential state-of-the-art solution for remote telemedicine, replacing current 2DTM virtual visits and driving better access and outcomes for patients.

Acknowledgments

We would like to acknowledge the following contributors to this project: Andrea Britto; Thiago Spina; Ben Cutler; Chris O’Dowd; Amber Hoak; Spencer Fowers; David Tittsworth; Whitney Hudson; Steven Lo, Canniesburn Regional Plastic Surgery and Burns Unit, Glasgow; Kwame Darko, Levi Ankrah, and Opoku Ampomah, National Reconstructive Plastic Surgery and Burns Center, Korle Bu Teaching Hospital, Accra. 

Additional thanks to: Korle Bu Teaching Hospital, NHS Scotland West of Scotland Innovation Hub, Canniesburn Plastic Surgery and Burns Unit.

Two pictures showcasing participants of the project. In the first picture, six men and a woman are captured. The group consists of Microsoft staff members, representatives from the Korle Bu Teaching Hospital, and the surgical team from NHS Glasgow. The second picture features members of the Microsoft team, three men and a woman, alongside a doctor from the Korle Bu Teaching Hospital, all attired in surgical garments.
Figure 6: Two views of medical team members. On the left (from left to right): Daniel Dromobi Nii Ntreh, Thiago Spina, Spencer Fowers, Chris O’Dowd, Steven Lo, Arnold Godonu, Andrea Britto. 
 On the right, in medical gear (from left to right): Chris O’Dowd, Kwame Darko, Thiago Spina, Andrea Britto and Spencer Fowers.

The post 3D telemedicine brings better care to underserved and rural communities, even across continents appeared first on Microsoft Research.

Read More

Research Focus: Week of May 22, 2023

Research Focus: Week of May 22, 2023

Microsoft Research
Research Focus 16 | Week of May 22nd, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Emre Kıcıman, Robert Ness, Amit Sharma, Chenhao Tan

Recent advances in scaling large language models (LLMs) have led to breakthroughs in AI capabilities, including writing code in programming languages, generating stories, poems, essays, and other texts, and strong performance in certain reasoning tasks. LLMs can even create plausible explanations for their outputs, and update their conclusions given new evidence.

At the same time, LLMs can make absurd claims and basic errors of logic, mathematics, and complex reasoning, which raises questions about their applicability in societally impactful domains such as medicine, science, law, and policy.

In a new paper: Causal Reasoning and Large Language Models: Opening a New Frontier for Causality, researchers from Microsoft examine the causal capabilities of LLMs. They find that LLMs, on average, can outperform state-of-the-art causal algorithms in graph discovery and counterfactual inference, and can systematize nebulous concepts like necessity and sufficiency of cause by operating solely on natural language input. They show that by capturing commonsense and domain knowledge about causal mechanisms, LLMs open new frontiers for advancing the research, practice, and adoption of causality. The researchers envision pairing LLMs alongside existing causal methods to reduce the required manual effort that has been a major impediment to widespread adoption of causal analysis. 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

NEW RESEARCH

DNA storage in thermoresponsive microcapsules for repeated random multiplexed data access

As the world generates more and more data, data storage capacity has not kept pace. Traditional long-term storage media such as hard disks or magnetic tape have limited durability and storage density. But DNA has an intrinsic capacity for information storage, durability, and high information density.

In DNA data storage, a large amount of data is stored together, and it is important to perform random access – selective retrieval of individual data files. This is achieved using polymerase chain reaction (PCR), a molecular process that can exponentially amplify a target file. However, this process can damage the data and cause errors. PCR amplification of multiple files simultaneously creates serious undesired DNA crosstalk. Currently one can only read one file at a time, but not a subset of files in a larger set.

In a recent paper: DNA storage in thermoresponsive microcapsules for repeated random multiplexed data access, researchers from Microsoft and external colleagues report on their work to develop a microcapsule-based PCR random access. By encapsulating individual files in each capsule, DNA files were physically separated, reducing undesired crosstalk. This enabled the simultaneous reading of all 25 files in the pool, without significant errors. The use of microcapsules also allowed DNA files to be recovered after random access, addressing the destructive reads problem and potentially making DNA data storage more economical.


MICROSOFT RESEARCH TALK

Human-centered AI with Ben Shneiderman, Distinguished University Professor—University of Maryland Department of Computer Science

A new synthesis is emerging that integrates AI technologies with human-computer interaction (HCI) to produce human-centered AI (HCAI). Advocates of HCAI seek to amplify, augment, and enhance human abilities, so as to empower people, build their self-efficacy, support creativity, recognize responsibility, and promote social connections. Researchers, developers, business leaders, policy makers, and others are expanding the technology-centered scope of AI to include HCAI ways of thinking.

In this recent Microsoft Research Talk: Human-Centered AI: Ensuring Human Control While Increasing Automation Ben Shneiderman discusses his HCAI framework, design metaphors, and governance structures and other ideas drawn from his award-winning new book Human-Centered AI. The talk by Shneiderman, a Distinguished University Professor in the University of Maryland Department of Computer Science, is hosted by Mary Czerwinski, Partner Researcher and Research Manager with Microsoft Research.


OPPORTUNITIES

AI and the New Future of Work – call for proposals

The Microsoft New Future of Work Initiative is now accepting proposals to fund academic projects that help maximize the impact of LLMs and related AI systems on how work gets done. This call for proposals targets work that specifically supports the use of LLMs in productivity scenarios. The program plans to distribute five $50,000 USD unrestricted awards to support creative research that redefines what work might mean in various contexts. 

For example: how can we ensure these new technologies truly accelerate productivity rather than having effects on the margins; how can LLMs achieve these gains by augmenting human labor; what is the future of a ‘document’ in a world where natural language can be so easily remixed and repurposed.  

Proposals will be accepted through June 5, 2023.

The post Research Focus: Week of May 22, 2023 appeared first on Microsoft Research.

Read More

REACT — A synergistic cloud-edge fusion architecture

REACT — A synergistic cloud-edge fusion architecture

This research paper was accepted by the eighth ACM/IEEE Conference on Internet of Things Design and Implementation (IoTDI), which is a premier venue on IoT. The paper describes a framework that leverages cloud resources to execute large deep neural network (DNN) models with higher accuracy to improve the accuracy of models running on edge devices.

iotdi logo on blue gradient background

Leveraging the cloud and edge concurrently

The internet is evolving towards an edge-computing architecture to support latency-sensitive DNN workloads in the emerging Internet of Things and mobile computing applications domains. However, unlike cloud environments, the edge has limited computing resources and cannot run large, high accuracy DNN models. As a result, past work has focused on offloading some of the computation to the cloud to get around this limitation. However, this comes at the cost of increased latency.

For example, in edge video analytics use cases, such as road traffic monitoring, drone surveillance, and driver assist technology, one can transmit occasional frames to the cloud to perform object detection—a task ideally suited to models hosted on powerful GPUs. On the other hand, the edge handles the interpolating intermediate frames through object tracking—a relatively inexpensive computational task performed using general-purpose CPUs, a low-powered edge GPU, or other edge accelerators (e.g., Intel Movidius Neural Stick). However, for most real-time applications, processing data in the cloud is infeasible due to strict latency constraints.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

In our research paper, REACT: Streaming Video Analytics On The Edge With Asynchronous Cloud Support, we propose and demonstrate a novel architecture that leverages both the edge and the cloud concurrently to perform redundant computations at both ends. This helps retain the low latency of the edge while boosting accuracy with the power of the cloud. Our key technical contribution is in fusing the cloud inputs, which are received asynchronously, into the stream of computation at the edge, thereby improving the quality of detection without sacrificing latency.

Fusing edge and cloud detections

Figure (a) illustrates how REACT leverages object detections from both the cloud and the edge. The intermediate frames use object tracking, whose performance degrades over time. The edge detections are received immediately but the ones from cloud are received with some delay.
Figure 1(a): Orange and green boxes indicate detection from edge and cloud. Tracking performance degrades with every frame, indicated by the fading shades of blue.
Figure (b) shows a couple of images from a dashcam and how REACT can help to improve object detection performance.
Figure 1(b): REACT uses asynchronous cloud detections to correct the box labels and detect more objects.

We illustrate our fusion approach in REACT for object detection in videos. Figure 1 shows the result of object detection using a lightweight edge model. This suffers from both missed objects (e.g., cars in Frame 1 are not detected) and misclassified objects (e.g., the van on the right of the frame that has been misclassified as a car).

To address the challenges of limited edge computation capacity and the drop in accuracy from using edge models, we follow a two-pronged approach. First, since the sequence of video frames are spatiotemporally correlated, it suffices to call edge object detection only once every few frames. As illustrated in Figure 1(a), edge detection runs every fifth frame. As shown in the figure, to interpose the intermediate frames, we employ a comparatively lightweight operation of object tracking. Second, to improve the accuracy of inference, select frames are asynchronously transmitted to the cloud for inference. Depending on network delay and the availability of cloud resources, cloud detections reach the edge device only after a few frames. Next, the newer cloud detections—previously undetected—are merged with the current frame. To do this, we feed the cloud detection, which was made on an old frame, into another instance of the object tracker to “fast forward” to the current time. The newly detected objects can then be merged into the current frame so long as the scene does not change abruptly. Figure 1(b) shows a visual result of our approach on a dashcam video dataset.

Here’s a more detailed description of how REACT goes about combining the edge and the cloud detections. Each detection contains objects represented by a ⟨class_label, bounding_box, confidence_score⟩ tuple. Whenever we receive a new detection (either edge or cloud), we purge from the current list the objects that were previously obtained from the same detection source (either cloud or edge). Then we form a zero matrix of size (c, n). Here, c and are the indices associated with detections from current list and new source, respectively. We populate the matrix cell with the Intersection over Union (IoU) values—if it is greater than 0.5—corresponding to specific current and new detections. We then perform a linear sum assignment, which matches two objects with the maximum overlap. For overlapped objects, we modify the confidence values, bounding box, and class label based on the new detections’ source. Specifically, our analysis reveals that edge detection models could correctly localize objects, but often had false positives, i.e., they assigned class labels incorrectly. In contrast, cloud detections have higher localization error but lower error for class labels. Finally, newer objects (unmatched ones) will then get added to the list of current objects with the returned confidence values, bounding boxes, and class labels. Thus, REACT’s fusion algorithm must consider multiple cases —such as misaligned bounding boxes, class label mismatch, etc. — to consolidate the edge and cloud detections into a single list.

Detector Backbone Where #params
Faster R-CNN ResNet50-FPN Cloud 41.5M
RetinaNet ResNet50-FPN Cloud 36.1M
CenterNet DLA34 Cloud 20.1M
TinyYOLOv3 DN19 Edge 8.7M
SSD MobileNetV2 Edge 3.4M
Table 1: Models used in our evaluation

In our experimentation, we leveraged state-of-the-art computer vision algorithms for getting object detections at the edge and the cloud (see Table 1). Further, we use mAP@0.5 (mean average precision at 0.5 IoU), a metric popular in the computer vision community to measure the performance of object detections. Moreover, to evaluate the efficacy of REACT, we looked at two datasets:

  1. VisDrone: as drone-based surveillance
  2. D2City: dashcam-based driver assist

Based on our evaluation, we observed that REACT outperforms baseline algorithms by as much as 50%. Also, we noted that edge and cloud models can complement each other, and overall performance improves due to our edge-cloud fusion algorithm.

As already noted, the object detector runs only once every few frames and a lightweight object tracking is performed on intermediate frames. Running detection redundantly at both the edge and the cloud allows an application developer to flexibly trade off the frequency of edge versus cloud executions while achieving the same accuracy, as shown in Figure 2. For example, if the edge device experiences thermal throttling, we can pick a lower edge detection frequency (say, once every 20 frames) and complement it with cloud detection once every 30 frames to get mAP@0.5 of around 22.8. However, if there are fewer constraints at the edge, we can increase the edge detection frequency to once every five frames and reduce cloud detections to once every 120 frames to get similar performance (mAP@0.5 of 22.7). This provides a playground for fine-grained programmatic control.

The figure shows a heatmap of object detection accuracy metric called mAP@0.5 with change in edge and cloud detection frequency. For higher accuracy, we need to run detections at a higher rate. The figure highlights the trade-off, i.e., to maintain accuracy, one can increase cloud detection frequency but reduce edge frequency, and vice versa.
Figure 2: mAP@0.5 values for varying cloud and edge detection frequency on the D2-City dataset. Similar shading corresponds to similar mAP@0.5.

Further, one can amortize the cost of using cloud resources over multiple edge devices by having these share the same cloud hosted model. Specifically, if an application can tolerate a median latency of up to 500 ms, we can support over 60 concurrent devices at a time using the V100 GPU (Figure 3).

A scatter plot showing the median response time with increasing number of concurrent edge devices that share the same GPU for model serving. Here, we have shown 4 types of GPUs. Initially, the median response times for all GPUs increase slowly till it reaches a knee point after which the increase is faster.
Figure 3: 50th percentile response time vs number of edge devices that concurrently share a cloud GPU

Conclusion

REACT represents a new paradigm of edge + cloud computing that leverages the resources of each to improve accuracy without sacrificing latency. As we have shown above, the choice between offloading and on-device inference is not binary, and redundant execution at cloud and edge locations complement each other when carefully employed. While we have focused on object detection, we believe that this approach could be employed in other contexts such as human pose-estimation, instance and semantic segmentation applications to have the “best of both worlds.”

The post REACT — A synergistic cloud-edge fusion architecture appeared first on Microsoft Research.

Read More