Abstracts: November 14, 2024

Abstracts: November 14, 2024

Outlined illustrations of Tong Wang and Bonnie Kruft for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Microsoft Senior Researcher Tong Wang joins guest host Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science, to discuss “Ab initio characterization of protein molecular dynamics with AI2BMD.” In the paper, which was published by the scientific journal Nature, Wang and his coauthors detail a system that leverages AI to advance the state of the art in simulating the behavior of large biomolecules. AI2BMD, which is generalizable across a wide range of proteins, has the potential to advance solutions to scientific problems and enhance biomedical research in drug discovery, protein design, and enzyme engineering.

Transcript

[MUSIC]

BONNIE KRUFT: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

I’m Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science and your host for today. Joining me is Tong Wang, a senior researcher at Microsoft. Tong is the lead author of a paper called “Ab initio characterization of protein molecular dynamics with AI2BMD,” which has just been published by the top scientific journal Nature. Tong, thanks so much for joining us today on Abstracts!


TONG WANG: Thank you, Bonnie.

KRUFT: Microsoft Research is one of the earliest institutions to apply AI in biomolecular simulation research. Why did the AI for Science team choose this direction, and—with this work specifically, AI2BMD—what problem are you and your coauthors addressing, and why should people know about it?

WANG: So as Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and the wigglings of atoms.” To study the mechanisms behind the biological processes and to develop biomaterials and drugs requires a computational approach that can accurately characterize the dynamic motions of biomolecules. When we review the computational research for biomolecular structure, we can get two key messages. First, in recent years, predicting the crystal, or static, protein structures with methods powered by AI has achieved great success and just won the Nobel Prize in Chemistry in the last month. However, characterizing the dynamic structures of proteins is more meaningful for biology, drug, and medicine fields but is much more challenging. Second, molecular dynamics simulation, or MD, is one of the most widely used approaches to study protein dynamics, which can be roughly divided into classical molecular dynamics simulation and quantum molecular dynamics simulation. Both approaches have been developed for more than a half century and won Nobel Prize. Classical MD is fast but less accurate, while quantum MD is very accurate but computationally prohibitive for the protein study. However, we need both the accuracy and the efficiency to detect the biomechanisms. Thus, applying AI in biomolecular simulation can become the third way to achieve both ab initio—or first principles—accuracy and high efficiency. In the winter of 2020, we have foreseen the trend that AI can make a difference in biomolecular simulations. Thus, we chose this direction.

KRUFT: It took four years from the idea to the launch of AI2BMD, and there were many important milestones along the way. First, talk about how your work builds on and/or differs from what’s been done previously in this field, and then give our audience a sense of the key moments and challenges along the AI2BMD research journey.

WANG: First, I’d like to say applying AI in biomolecular simulation is a novel research field. For AI-powered MD simulation for large biomolecules, there is no existing dataset, no well-designed machine learning model for the interactions between the atoms and the molecules, no clear technical roadmap, no mature AI-based simulation system. So we face various new challenges every day. Second, there are some other works exploring this area at the same time. I think a significant difference between AI2BMD and other works is that other works require to generate new data and train the deep learning models for any new proteins. So it takes a protein-specific solution. As a contrast, AI2BMD proposes a generalizable solution for a wide range of proteins. To achieve it, as you mentioned, there are some key milestones during the four-year journey. The first one is we proposed the generalizable protein fragmentation approach that divides proteins into the commonly used 20 kinds of dipeptides. Thus, we don’t need to generate data for various proteins. Instead, we only need to sample the conformational space of such dipeptides. So we built the protein unit dataset that contains about 20 million samples with ab initio accuracy. Then we proposed ViSNet, the graph neural network for molecular geometry modeling as the machine learning potential for AI2BMD. Furthermore, we designed AI2BMD simulation system by efficiently leveraging CPUs and GPUs at the same time, achieving hundreds of times simulation speed acceleration than one year before and accelerating the AI-driven simulation with only ten to a hundred millisecond per simulation step. Finally, we examined AI2BMD on energy, force, free energy, J coupling, and many kinds of property calculations for tens of proteins and also applied AI2BMD in the drug development competition. All things are done by the great team with science and engineering expertise and the great leadership and support from AI for Science lab.

KRUFT: Tell us about how you conducted this research. What was your methodology?

WANG: As exploring an interdisciplinary research topic, our team consists of experts and students with biology, chemistry, physics, math, computer science, and engineering backgrounds. The teamwork with different expertise is key to AI2BMD research. Furthermore, we collaborated and consulted with many senior experts in the molecular dynamics simulation field, and they provided very insightful and constructive suggestions to our research. Another aspect of the methodology I’d like to emphasize is learning from negative results. Negative results happened most of the time during the study. What we do is to constantly analyze the negative results and adjust our algorithm and model accordingly. There’s no perfect solution for a research topic, and we are always on the way.

KRUFT: AI2BMD got some upgrades this year, and as we mentioned at the top of the episode, the work around the latest system was published in the scientific journal Nature. So tell us, Tong—what is new about the latest AI2BMD system? 

WANG: Good question. We posted a preliminary version of AI2BMD manuscript on bioRxiv last summer. I’d like to share three important upgrades through the past one and a half year. The first is hundreds of times of simulation speed acceleration for AI2BMD, which becomes one of the fastest AI-driven MD simulation system and leads to perform much longer simulations than before. The second aspect is AI2BMD was applied for many protein property calculations, such as enthalpy, heat capacity, folding free energy, pKa, and so on. Furthermore, we have been closely collaborating with the Global Health Drug Discovery Institute, GHDDI, a nonprofit research institute founded and supported by the Gates Foundation, to leverage AI2BMD and other AI capabilities to accelerate the drug discovery processes.

KRUFT: What significance does AI2BMD hold for research in both biology and AI? And also, what impact does it have outside of the lab, in terms of societal and individual benefits?

WANG: Good question. For biology, AI2BMD provides a much more accurate approach than those used in the past several decades to simulate the protein dynamic motions and study the bioactivity. For AI, AI2BMD proves AI can make a big difference to the dynamic protein structure study beyond AI for the protein static structure prediction. Raised by AI2BMD and other works, I can foresee there is a coming age of AI-driven biomolecular simulation, providing binding free-energy calculation with quantum simulation accuracy for the complex of drug and the target protein for drug discovery, detecting more flexible biomolecular conformational changes that molecular mechanics cannot do, and opening more opportunities for enzyme engineering and vaccine and antibody design.

KRUFT: AI is having a profound influence on the speed and breadth of scientific discovery, and we’re excited to see more and more talented people joining us in this space. What do you want our audience to take away from this work, particularly those already working in the AI for Science space or looking to enter it?

WANG: Good question. I’d like to share three points from my research experience. First is aim high. Exploring a disruptive research topic is better than doing 10 incremental works. In the years of research, our organization always encourages us to do the big things. Second is persistence. I remembered a computer scientist previously said about 90% of the time during research is failure and frustration. The rate is even higher when exploring a new research direction. In AI2BMD study, when we suffered from research bottlenecks that cannot be tackled for several months, when we received critical comments from reviewers, when some team members wanted to give up and leave, I always encourage everyone to persist, and we will make it. More importantly, the foundation of persistence is to ensure your research direction is meaningful and constantly adjust your methodology from failures and critical feedback. The third one is real-world applications. Our aim is to leverage AI for advancing science. Proposing scientific problems is a first step, then developing AI tools and evaluating on benchmarks and, more importantly, examining its usefulness in the real-world applications and further developing your AI algorithms. In this way, you can close the loop of AI for Science research.

KRUFT: And, finally, Tong, what unanswered questions or unsolved problems remain in this area, and what’s next on the agenda for the AI2BMD team?

WANG: Well, I think AI2BMD is a starting point for the coming age of AI-driven MD for biomolecules. There are lots of new scientific questions and challenges coming out in this new field. For example, how to expand the simulated molecules from proteins to other kinds of biomolecules; how to describe the biochemical reactions during the simulations; how to further improve the simulation efficiency and robustness; and how to apply it for more real-world scenarios. We warmly welcome any people from both academic and industrial fields to work together with us to make the joint efforts to push the frontier of this new field moving forward.

[MUSIC]

KRUFT: Well, Tong, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on AI2BMD, you can find a link at aka.ms/abstracts, or you can read it on the Nature website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 14, 2024 appeared first on Microsoft Research.

Read More

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Modular Models blog hero

Today, development of generalizable AI models requires access to sufficient data and compute resources, which may create challenges for some researchers. Democratizing access to technology across the research community can advance the development of generalizable AI models. By applying the core software development concept of modularity to AI, we can build models that are powerful, efficient, adaptable, and transparent. 

Until recently, AI models were primarily built using monolithic architecture. Though powerful, these models can be challenging to customize and edit compared to modular models with easily interpretable functional components. Today, developers employ modularity to make services more reliable, faster to refine, and easier for multiple users to contribute to simultaneously. One promising research direction that supports this involves shifting AI development towards a modular approach (opens in new tab), which could enhance flexibility and improve scalability. 

One such approach is to use numerous fine-tuned models designed for specific tasks, known as expert models, and coordinate them to solve broader tasks (see Towards Modular LLMs by Building and Reusing a Library of LoRAs – Microsoft Research (opens in new tab)Learning to Route Among Specialized Experts for Zero-Shot Generalization (opens in new tab)). These expert models can be developed in a decentralized way. Similar to the benefits of using a microservice architecture, this modular AI approach can be more flexible, cheaper to develop, and more compliant with relevant privacy and legal policies. However, while substantial research has been done on training optimization, coordination methods remain largely unexplored.

Our team is exploring the potential of modular models by focusing on two themes: i) optimizing the training of expert models and ii) refining how expert models coordinate to form a collaborative model. One method for coordinating expert models is to adaptively select the most relevant independently developed expert models for specific tasks or queries. This approach, called MoErging, is similar to Mixture-of-Experts (MoE) approaches but differs in that the routing mechanism is learned after the individual experts are trained. As an initial step, we contributed to creating a taxonomy for organizing recent MoErging methods with the goal of helping establish a shared language for the research community and facilitating easier and fairer comparisons between different methods. 

Assessing existing MoErging methods

Most MoErging methods were developed within the past year, so they don’t reference each and are difficult to compare. To enable comparison of MoErging methods, we recently collaborated on a survey that establishes a taxonomy for comparing methods and organizes MoErging design choices into three steps: 

  • Expert design: Identifies and uses expert models trained asynchronously by distributed contributors. 
  • Routing design: Routes tasks to the appropriate expert models. 
  • Application design: Applies the merged models to specific tasks or domains. 

Each step is broken down into more detailed choices. For example, in expert design, expert training can be custom or standard, and training data can be private or shared. Custom training requires MoErging to have specific training procedures, while the standard training does not. Similarly, shared data means that the training data must be accessible for routing. Otherwise, the training data is considered private. 

The benefits of modular models discussed below assume that training data doesn’t need to be shared. However, a review of current MoErging methods finds that some approaches do require sharing training data, making certain benefits no longer applicable. 

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.


The survey evaluates 29 different MoErging methods using its taxonomy, which categorizes the design choices into two expert design choices, five routing design choices, and two application design options, shown in Figure 1.

Taxonomy of model MoErging design choices. References in the leaf noes link to sections for specific papers that make some particular design choice. We omit references to methods for which a given choice is not applicable.
Figure 1: Taxonomy of model MoErging design choices. References in the leaf nodes link to sections of specific papers that implement each choice. We omit references to methods where a particular choice is not applicable. 

One takeaway from the survey is that most MoErging methods can be grouped into four categories based on their routing design choices:

  1. Classifier-based routing: Methods that train the router as a classifier using expert datasets or unseen data. 
  2. Embedding-based routing: Methods that compute embeddings of expert training sets and compare them to a query embedding for routing. 
  3. Nonrouter methods: Methods that do not explicitly train a router but instead initialize the router in an unsupervised manner.  
  4. Task-specific routing: Methods that learn a task-specific routing distribution over the target dataset to improve performance on a specific task. 

While the differences within each category are minor, the differences across categories are significant because they determine the level of data access required for implementation. As a result, data access is a primary factor in determining which methods are applicable and feasible in various settings. 

Our taxonomy also covers recent approaches to building agentic systems, which could be viewed as specific types of MoErging methods where experts are full language models and routing decisions are made on a step-by-step or example-by-example basis. The optimal level for MoErging may vary depending on the task and the computational resources available to each stakeholder. 

Potential benefits and use cases of modular models 

Modular models can unlock new benefits and use cases for AI, offering a promising approach to addressing challenges in current AI development. Moving forward, further substantial research is needed to validate this potential and assess feasibility.  

Modular AI may: 

  • Allow privacy-conscious contributions.  Teams with sensitive or proprietary data, such as personally identifiable information (PII) and copyrighted content, can contribute expert models and benefit from larger projects without sharing their data. This capacity can make it easier to comply with data privacy and legal standards, which could be valuable for healthcare teams that would benefit from general model capabilities without combining their sensitive data with other training data. 
  • Drive model transparency and accountability.  Modular models allow specific expert models to be identified and, if necessary, removed or retrained. For example, if a module trained on PII, copyrighted, or biased data is identified, it can be removed more easily, eliminating the need for retraining and helping ensure compliance with privacy and ethical standards. 
  • Facilitate model extensibility and continual improvement. Modularity supports continual improvements, allowing new capabilities from expert models to be integrated as they are available. This approach is akin to making localized edits, allowing for continuous, cost-effective improvement. 
  • Lower the barrier to AI development for those with limited compute and data resources. Modular AI can reduce the need for extensive data and compute by creating a system where pretrained experts can be reused, benefiting academics, startups, and teams focused on niche use cases. For example, an AI agent tasked with booking flights on a specific website with limited training data could leverage general navigation and booking skills from other trained AI experts, enabling generalizable and broadly applicable skills without requiring domain-specific training data. We explore this process of transferring skills across tasks in our paper “Multi-Head Routing For Cross-Task Generalization.” 
  • Support personalization.  Modular models make it possible to equip AI agents with experts tailored to individual users or systems. For instance, AI designed to emulate five-time World Chess Champion Magnus Carlsen could enhance a player’s preparation to play a match against him. Experiments suggest that storing knowledge or user profiles in on-demand modules can match or surpass the performance of retrieval-augmented generation (RAG), potentially reducing latency and improving the user’s experience in custom AI applications. 

Current limitations and looking forward 

In this blog, we focused on a type of modular approach that involves training foundation models, which requires substantial compute power and large amounts of data. Despite the advantages of modularity, such as increased flexibility, efficiency, and adaptability, the development of foundation models remains resource-intensive, necessitating high-performance computing and robust datasets to support fine-tuning.  

Recent work has begun to address these challenges by distributing the pretraining process of foundation models (opens in new tab). Looking ahead, a promising research direction focuses on exploring how to create a minimal dataset for training “empty foundation models” while shifting most of their capabilities to external pluggable modules. 

Modular methods are evolving rapidly, and we’re excited by their potential. Modularity has the capacity to democratize AI development, improve model accountability, and support efficient continuous learning. With the MoErging taxonomy, we aim to establish a shared language that fosters engagement within the research community. This research is in the early stages, and we welcome community collaboration. If you’re interested in working with us, please reach out to ModularModels@microsoft.com

Acknowledgements

We would like to thank paper collaborators: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Nabil Omi, Siddhartha Sen, Anurag Sarkar, Jordan T. Ash, Oleksiy Ostapenko, and Laurent Charlin.

The post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.

Read More

Research Focus: Week of November 11, 2024

Research Focus: Week of November 11, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of November 11, 2024

Look Ma, no markers: holistic performance capture without the hassle

Motion-capture technologies used in film and game production typically focus solely on face, body, or hand capture, requiring complex and expensive hardware and lots of manual intervention from skilled operators. While machine-learning-based approaches can overcome these challenges, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts.

In a recent paper: Look Ma, no markers: holistic performance capture without the hassle, researchers from Microsoft introduce a technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. This approach produces stable world-space results from arbitrary camera rigs while also supporting varied capture environments and clothing. The researchers achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. They evaluate their method on a number of body, face, and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets. 


Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge, is a high-impact application. Interest is growing in AI for IT Operations (AIOps), which aims to automate complex operational tasks like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents.  

In a recent paper: Building AI Agents for Autonomous Clouds: Challenges and Design Principles, researchers from Microsoft lay the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. The researchers also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. The paper sets the stage for building a modular and robust framework for building, evaluating, and improving agents for autonomous clouds. 

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

AI-assisted programming offers great promise, but also raises concerns around the trustworthiness of AI-generated code. Proof-oriented languages like F* (opens in new tab) enable authoring programs backed by machine-checked proofs of correctness. Using AI to generate code and proofs in proof-oriented languages helps mitigate these concerns, while also making proof-oriented programming more accessible to people. 

In a recent preprint: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, researchers from Microsoft and external colleagues explore using AI to automate the construction of proof-oriented programs. The researchers curate a dataset of 940,000 lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. The dataset includes around 54,000 top-level F* definitions, each representing a type-directed program and proof synthesis problem. A program fragment checker queries F* to check the correctness of candidate solutions. With this dataset, the researchers explore using AI to synthesize programs and their proofs in F*, finding the performance of fine-tuned smaller language models to compare favorably with LLMs, at much lower computational cost.


One-to-many testing for code generation from (just) natural language

The mostly basic Python programs (MBPP) dataset is commonly used for evaluating natural language models on the task of code generation. Despite its popularity, the original MBPP has two major problems: it relies on providing test cases to generate the right signature and there is poor alignment between “what is asked” and “what is evaluated” using the test cases. 

To address these challenges, in their recent “One-to-many testing for code generation from (just) natural language” paper, researchers from Microsoft introduce the “mostly basic underspecified Python programs” or MBUPP dataset. This dataset adapts MBPP to emphasize the natural language aspect by allowing for some syntactic ambiguity (like not specifying the return type of a function) and evaluating generated code on multiple sets of assertions (like each set covering a different return type). Besides iteratively inspecting LLM results to extend the assertions sets, the researchers carefully remove poor alignment from the instructions (like a specific algorithm to use) and perform a majority vote over slightly paraphrased instructions to improve the quality of the dataset. The researchers compare popular open and closed weight models on the original MBPP and adapted MBUPP datasets to highlight the effect of paraphrasing and new test cases on code generation evaluation.  The MBUPP dataset is publicly available to encourage its use in evaluation code generation models.


The post Research Focus: Week of November 11, 2024 appeared first on Microsoft Research.

Read More

Preventing side-channels in the cloud

Preventing side-channels in the cloud

Icons representing hardware and devices, security, privacy, and cryptography, and systems and networking on a blue to green gradient background.

Cloud computing delivers scalable and cost-effective compute resources to a wide range of customers. The ability for cloud providers to share components of the hardware stack across customers, or tenants, is essential for running efficient cloud systems. For example, modern central processing units (CPUs) pack hundreds of physical hardware threads sharing terabytes of dynamic random-access memory (DRAM), which can be flexibly assigned to many independent virtual machines (VMs).

Preventing tenants from snooping on others who share the same hardware requires security mechanisms. Microsoft Azure (opens in new tab) provides strong protection via comprehensive architectural isolation through access control mechanisms implemented across the cloud platform, including the hardware and the hypervisor. Confidential computing (opens in new tab) powered by trusted execution environments further hardens architectural isolation via hardware memory encryption to protect tenants even against privileged attackers. 

A changing threat landscape

Even with perfect architectural isolation, sharing microarchitectural resources, such as CPU caches and DRAM row buffers, can leak small amounts of information, because interference (due to sharing) leads to variations in the latency of memory accesses. This gives rise to so-called microarchitectural side-channel attacks where a malicious tenant can learn information about another tenant, in the worst case: their cryptographic keys.

Microsoft Azure protects tenants and critical infrastructure against currently practical side-channel attacks. For example, side-channels in on-core resources (e.g., buffers, predictors, private caches) are comprehensively (opens in new tab) mitigated by Hyper-V HyperClear (opens in new tab) via core scheduling, microarchitectural flushing and scrubbing, and virtual-processor address space isolation; and our cryptographic libraries are carefully hardened to prevent any secrets from being leaked via microarchitectural side-channels. 

However, the threat landscape is changing. First, side-channel attacks are becoming increasingly sophisticated: For example, recent academic research (opens in new tab) has shown that even cache-coherence directories can be exploited to leak information across cores. Second, future CPUs are likely to employ increasingly sophisticated microarchitectural optimizations, which are prone to new kinds of attacks: For example, the recently introduced data-dependent prefetchers have already been found to leak information (opens in new tab).

In Azure Research’s Project Venice, we are investigating principled defenses, to be prepared in case such emerging attacks start posing a risk to Azure customers.

Preventing microarchitectural side-channels with resource-exclusive domains

In a research paper (opens in new tab), which has received a distinguished paper award at the ACM Conference on Computer and Communications Security (ACM CCS’24 (opens in new tab)), we present a system design that can prevent cross-VM microarchitectural side-channels in the cloud. Our design provides what we call resource-exclusive domains, which extend the architectural abstraction of private physical threads and private memory to the microarchitectural level. That is, resource-exclusive domains guarantee isolation even against powerful attackers that try to mount side-channel attacks on shared microarchitectural resources.

Our approach builds on isolation schemes, a novel abstraction of the way a CPU shares microarchitectural structures between its physical threads.  Isolation schemes can be used by the hypervisor and host operating system to assign physical threads and physical memory pages, eliminating the risk of information leakage across resource-exclusive domains. Technically, for a given assignment of physical threads to resource-exclusive domains, the isolation scheme partitions each microarchitectural resource that is shared between domains (as this would leak information), but without partitioning resources that are private to a domain (as this would affect performance). We achieve this using hardware mechanisms, if available, and multi-resource memory coloring, if not.

In a complementary research paper (opens in new tab) (appearing at ACM CCS’24 (opens in new tab)), we provide the theoretical foundations and practical algorithms for computing such multi-resource memory coloring schemes for existing microarchitectures, as well as design patterns for future microarchitectures to support a large number of resource-exclusive domains. 

We have implemented our approach in a research prototype based on Microsoft Hyper-V for a modern cloud chiplet-based CPU, AMD EPYC 7543P, that supports VM-level trusted execution environments. Using a collection of microbenchmarks and cloud benchmarks, we demonstrate that our approach eliminates all identified side-channels and incurs only small performance overheads. For example, when allocating resources at chiplet and channel granularity (i.e., coupling a chiplet with one of the local DRAM channels) we observe an overhead of less than 2%; and only up to 4% when allocating resources at chiplet granularity and coloring with 2MB pages.

Co-designing cloud platforms for future microarchitectural isolation

To validate the effectiveness and practicality of our approach, we inferred isolation schemes for a single CPU by reverse-engineering its microarchitecture. This approach is incomplete and does not scale to the diverse hardware fleet available in the cloud. We are working with CPU vendors to develop isolation schemes for future CPUs, which will then be exposed via the hardware interface for consumption by the hypervisor’s hardware abstraction layer. In this way, we will be able to reap the benefits of microarchitectural performance optimizations while continuing to provide strong security guarantees to cloud tenants. 

Additional Contributors

Cédric Fournet, Senior Principal Researcher
Jana Hofmann, Researcher
Oleksii Oleksenko, Senior Researcher

The post Preventing side-channels in the cloud appeared first on Microsoft Research.

Read More

Collaborators: Prompt engineering with Siddharth Suri and David Holtz

Collaborators: Prompt engineering with Siddharth Suri and David Holtz

Illustrated images of Siddharth Suri and David Holtz. “Collaborators: A Microsoft Research Podcast” runs along the bottom.

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

How significant will prompt engineering be as generative AI models continue to advance? After previous successful collaborations, Siddharth Suri, a Microsoft senior principal researcher, and David Holtz, an assistant professor at the University of California, Berkeley and a former intern of Suri’s, reunited to address the debate with data. In this episode, they discuss their study of how prompting approaches change as models advance. They share how the work required finding a variety of additional perspectives in what they describe as an Ocean’s Eleven-style recruitment effort; why mastering chain-of-thought prompting and other specialized methods might not be a prerequisite for getting what you want from a model; and, for aspiring researchers, what some butterflies can tell you about the types of challenges you’re pursuing. Suri and Holtz’s work is part of the Microsoft Research initiative AI, Cognition, and the Economy, or AICE, and is supported by the Microsoft Research initiative Accelerate Foundation Models Research, or AFMR.

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

SIDDHARTH SURI: So, it’s, like, just before Thanksgiving 2020. My manager came to me, and she was like, Sid, we need somebody to understand, what are the effects of AI on society? And I was like, “Oh, yeah, small question! Yeah, I can do that by myself! Yeah. I’ll get you an answer by Tuesday,” OK? I felt like I was dropped in outer space, and I had to find Earth. And I didn’t even … I couldn’t even see the sun. Like, I … there was this entirely new system out there. No one knew how to use it. What are the right questions to ask? We were using the system to study how people use the system? Like, what the heck is going on?

DAVID HOLTZ: And I remember thinking, this seems like the most important thing that a person could be working on and studying right now. Like, anything else that I’m working on seems unimportant in comparison to the impact that this technology is poised to have on so many different facets of, you know, life and the economy and things like that.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.


[MUSIC FADES]

Today I’m talking to Dr. Siddharth Suri, also known as Sid, who’s a computational social scientist and a senior principal researcher at Microsoft Research. With him is Dr. David Holtz, an assistant professor in the Haas School of Business at the University of California, Berkeley. Sid and David are co-leading a team of researchers who are exploring the fascinating world of prompt engineering as part of the AI, Cognition, and the Economy, or AICE, initiative at Microsoft Research. I can’t wait to get into the meat of this research, but before we do, let’s meet our researchers. Sid, you first!

SIDDHARTH SURI: Hey, Gretchen, thanks for having me.

HUIZINGA: Tell us about yourself. At what intersection do your research interests lie, and what path led you to what you’re doing at Microsoft Research today?

SURI: So I got to where I am now through a very long and circuitous route, and I’ll give you the sort of CliffsNotes version of it, if you will. If you start back in grad school, my dream was to become a theoretical computer scientist. And what that basically means is writing algorithms. And what that basically means is pushing Greek symbols around a page. [LAUGHTER] And it turns out I’m good at that, but I’m not great at that. And towards the end of grad school, I was working with another professor, and he was doing these experiments that involved humans, and what we would do is we bring undergraduates into a lab. They were sitting in front of a computer using our software. We’d arrange them in different networks, so you’re trying to solve a problem with the people who are next to you in this network. And then we would change the structure of that network and have them solve the problem again. And we would try to understand, how does the structure of this network affect their ability to solve this problem? And I remember analyzing this data. I just was swimming around in this data and having a grand old time. I … nights, weekends … I remember riding the bus to school in Philadelphia, and I was trying to think about new analyses I could do. And it was just so … it was fun. I couldn’t get enough. And I remember my adviser talking to me one day, and he’s like, Sid, you’re really good at this. And I responded with, really good at what? I’m just doing the obvious thing that anybody would do. And he was like, bro, this is not obvious. Like, you know, you got a knack for this. And then that, sort of, set me on this path, and then, just to make a little long story short, I don’t have tons of self-awareness. So it took me like 10 full years to go from, like, deciding to hang up being a theoretical computer scientist and understanding humans, human behavior, and using technology to understand human behavior. And that’s, kind of, where I ended up as a computational social scientist. I’ve sort of gone all in in that space, as a computational social scientist. And that’s how David and I met. He’s a rising star in that space, as well. He became my intern. And that’s how we met. I’ll let him share his origin story with you.

HUIZINGA: Well, let’s do, David. I noticed you have a strong science background, but now you’re an assistant professor in a business school. So you got to do a little dot-connecting here. How did a guy with a degree in physics and astronomy—and should I also mention theater and dance? I’m so intrigued—um, how did that guy wind up working with MBAs and economists?

DAVID HOLTZ: Yeah, thanks for having me, Gretchen. Similar to Sid, my path to where I am today is also long and circuitous, and I will try to give you the CliffsNotes version. When I was young, I was always super interested in physics, and I think what drew me to physics was the way that it combined math, which I was very good at when I was younger, and the ability to answer big existential questions. Where does the universe come from? What’s the universe made out of? Is it growing? Is it shrinking? Things like that. And so when I went to college, I didn’t think too deeply about what I was going to study. I just, sort of, you know, always wanted to do physics. I’m going to do physics. And so I majored in physics. And then … I did my undergrad at Princeton, and there’s something about the physics department at Princeton where it’s almost just assumed everyone’s going to go get their PhD. And so there was a lot of “ambient pressure” to apply to graduate school. And so I actually started my physics PhD at Johns Hopkins. And as a PhD student, I was working on these large telescopes that look at remnant light from right after the Big Bang and try to characterize, you know, tiny fluctuations in this field of light that fills the night sky in a wavelength-like range that is not visible to the human eye. And by, sort of, characterizing those fluctuations in the light field, you can learn things about what the universe is made out of and how it’s evolving and all these types of things. It all sounds very cool. But the teams that conduct this research at this point are really big. It’s like you’re in a company, essentially. So there’s a hundred people working on building this telescope, analyzing these telescopes, so on and so forth. And so the actual day to day of my life as a physics PhD student was really far removed from the big existential questions that I was actually really interested in. My PhD dissertation probably would have been developing a system that moved a mirror in exactly this way so that light polarization appears, you know, in the experimental apparatus. You’re basically doing an engineering degree. And on top of all that, like Sid, I was good at physics, but I think I realized I was not great at physics. And I saw a lot of people around me in my classes and in my labs that were great at physics and moreover were having a really hard time finding a job as a physics professor after they graduated despite being great at physics. And so I started having these realizations during graduate school and had never done anything really except physics and so took a leave of absence and actually came out to the Bay Area and started working out here in advertising, which is not something that I’m necessarily super excited about—and as a product manager, which is not what I do. But it was kind of the hop that I needed to try something different. And after some amount of time, moved from doing product management to doing data science. This was right when the data science boom was starting. I think the year that I came to the Bay Area, DJ Patil, who used to be the chief data scientist for the US, had written this very famous HBR article about, you know, how data science was the sexiest job of the 21st century …

HUIZINGA: Right!

HOLTZ: … so I, kind of, took my physics credentials and became a data scientist and eventually also moved out of advertising and went and worked at Airbnb, which at the time was growing really quickly and, you know, was sort of a young company where a lot of exciting things were happening. You know, I loved working at Airbnb. I learned a lot. I met a lot of interesting people. I learned a lot working in ad tech, as well, and eventually just found myself feeling pulled back to academia. Like, I really liked the questions that I was working on, the types of work that I was doing. Similar to Sid, I found that I was really good at analyzing data. I didn’t feel like I was doing anything particularly crazy, but people around me were saying, no man, you’re really good at this! And so I started looking for PhD programs where I could do the type of work that I was doing as a data scientist at Airbnb but in a more academic environment. And that, sort of, naturally led me to PhD programs in business schools. I didn’t know what a PhD in a business school entailed, but there were professors in those departments that were doing the research that I wanted to do. And so that’s how I ended up there. And so my research when I started out as a PhD student was, I think, relative to a lot of people, I didn’t start from, like, first principles. I don’t know that I necessarily had this one little thing that I was super interested in. I was really interested in solving applied problems and, in particular, I think some of the applied problems that I had seen out in the world working in tech. And over time, I think I found that I’m just really interested in new technologies and how those technologies affect, you know, the flow of information, how people collaborate, what happens to the economy, so on and so forth. And so I sort of started by just trying to answer a few problems that were in front of me and discovered this was kind of, you know, sort of the unifying theory of the things that I was interested in studying. And I think … you know, in hindsight, I think one thing that is true that has kind of guided, you know, my path—and this connects back to the theater and dance, you know, minor that you had alluded to earlier—is I’ve always been a really social person. I’ve always been really interested in humans and how they interact. I think that type of storytelling is really at the crux of, you know, theater and music and things like that. And when I was younger, for sure, I spent a lot of time writing music, playing music, doing improv comedy, performing on stage. And as a physicist, that itch wasn’t necessarily getting scratched, both because I was just studying, you know, extremely small particles and was doing it in a pretty lonely lab. And a nice thing about being a computational social scientist is that I’m studying humans, which is really interesting. I think it plugs into something that I’m really passionate about. And a cool thing about getting to do that in particular in a business-school setting, I think, is that, you know, I’m talking often to people at companies and, you know, lecturing to MBA students, who are really outgoing, gregarious people. And so it presents a really nice opportunity to, kind of, fuse, you know, my interest in science and information and technology with that other interest in humans and connection and, you know, the opportunity to, sort of, interact with people.

HUIZINGA: Yeah, yeah. Well, escaping from middle management in physics is probably a good thing … Well, before we get into the details of your collaboration on prompt engineering, let’s make sure everyone knows what we’re talking about. Sid, when we talked before, I told you, to be honest, when I first heard the phrase “prompt engineer” a couple years ago, I laughed because I thought it was a joke, like sanitation engineer. Then when I heard it was a real job, I laughed a little bit less. And then when I heard it was not only a real job but one that, if you were good at it, could pay six figures, I stopped laughing altogether and started paying attention. So I’d like you, Sid, to give us a brief history of prompt engineering. What is it, when and how did it become a thing, and why is it different from anything I’d do in garden-variety internet search?

SURI: So generative AI wants to do just that. It wants to generate something for you. But how do you express what you want? What do you want the system to give you? And the answer is a prompt. So I’ll give you an example. Whenever there’s a new model out there, especially one that generates images, a prompt I use—you might laugh at this—is, “Show me a picture of Bruno Mars on the surface of Mars eating a Mars bar.” [LAUGHTER] And the reason why I use that prompt is because Mars bars aren’t in the training data. There’s not a lot of pictures of Mars in the training data. And everybody knows who Bruno Mars is. So that’s me describing to the model what I want. That is a prompt. Show me a picture with these elements in it, OK? But this is where the hard part starts. It sends you something. Oh. I didn’t want Mars to be that color of red. Could you change it to a deeper red or more of an orange? OK. Now, could you put a little dust in the atmosphere? OK. Well, I want a moon in the background. I didn’t know I wanted a moon in the background, but now I do. Where’s the sun in this image? I don’t know. And then the whole thing, kind of, becomes much more rich and a much bigger exploration compared to, say, putting keywords into a search engine. It’s a really much more rich space to explore. Now you asked me … a part of your question was, why is prompt engineering difficult? It’s difficult for a number of reasons. Number one, you don’t always know what you want.

HUIZINGA: Yeah …

SURI: And so it’s that conversation with the system to figure that out. Number two, you might not be expressing what you want as clearly as you think you are.

HUIZINGA: Right …

SURI: Number three, the problem could be on the receiver end. These models are new. You might be expressing it clearly, but they might not be understanding what you’re saying as clearly as you would hope. And then the fourth reason is the one I just said, which is, like, what you’re asking for is not just like, “Give me a document relevant to these keywords,” or “Give me some information relative to these keywords,” as you would do in traditional search. You’re asking for something much more rich. And to get that richness that you were hoping for requires this prompt. And that requires an exploration of the idea in your head and an expression of that idea in the real world. So that’s what prompt engineering is, and that’s why it’s hard.

HUIZINGA: OK, and when would you say it became a thing? I mean, prompt engineer is an actual job, but it was a thing first, right? It didn’t start out to be a job; it started out to be something you did, so …

SURI: So when these models came out, you know, what was it, late, around 2020, late 2020, I think, when they first started becoming popular. So prompting had been around in academia a few years prior to that, but it first hit the mainstream when these models, sort of, first came out around 2020, and why … why this job? Why this six-figure salary? Why all … what’s all the hoopla about it? And like I said before, these systems are new. No one knew how to use them. No one knew how to express what they want, A. B, there’s a lot of arcane ways to prompt that aren’t obvious at the beginning. Like, I’ll give you a few examples. One way to prompt is to give the system examples of what you’re looking for. Say you want something to classify an email as spam or not spam. You might give it a few emails that are spam and a few emails that are not spam and say, hey, if it’s more like this, call it spam; if it looks more like that, call it not spam. And so that’s one example. Another example would be like, OK, I’m a small-business owner. I need some advice. This is the problem I’m facing. Give me some advice to solve this problem as if you were Bill Gates.

HUIZINGA: Oh …

SURI: That’s, like, adopting a persona. That’s another example. A third example would be, like, OK, you have a math problem. You’re trying to solve this math problem, and to get it done correctly, some of these systems need what’s known as chain-of-thought prompting, which is tell me all the steps you’re going through to solve this problem. Don’t just give me the answer 17. Give me all the steps you needed to get to 17. And that helps the system guide it, more likely, towards a correct answer. And so these are all arcane, esoteric methodologies to getting one of these models to give you the right answer, the answer you want. And being a prompt engineer means you’re an expert in these things and you’re more likely to get these correct answers than maybe someone off the street who isn’t familiar with these techniques.

HUIZINGA: Right, right, right. Well, we’re going to talk a lot more about technique and the research that you did. And you’ve alluded to, at the beginning here, a visual, like describing … I heard graphic designers hearing the client when you were talking about it: “I didn’t want that red. Maybe put the moon in …” [LAUGHS]

SURI: Yeah, exactly!

HUIZINGA: Can you just tell me what you want to begin with? No, apparently not. But you’re also talking about verbal prompts and writing and so on. So we’ll get into that in a bit. But I want to go over and talk a little bit more about this research and why it’s where it is. This episode is the latest in our “series within a series” on AI, Cognition, and the Economy at Microsoft Research. And so far, we’ve talked about the impacts of AI on both cognition with Abi Sellen and the economy with Mert [Demirer] and Brendan [Lucier]. You can look up those episodes, fantastic episodes. This topic is a little less obvious, at least to me. So, David, maybe you could shed some light on how research for prompt engineering became part of AICE and why it’s an important line of research right now.

HOLTZ: So I think this project relates to both cognition and the economy. And let me lay out for you the argument for both. So first, you know, I’m not a cognitive scientist, but I think there are some interesting questions around how people, and in particular common people who are not computer scientists, conceive of and interact with these models, right. So how do they learn how to prompt? Do they think about different generative models as all being the same, or are they sort of developing different prompting strategies for different models? What are the types of tricks that they discover or use when they’re prompting models? And at the time that we started working on this project, there wasn’t a lot of research on this and there wasn’t a lot of data on this. You know, the data that existed typically is on the servers of big companies like Microsoft. It’s not really available to the public or to many researchers. And then the research is all, you know, sort of disproportionately focused on these esoteric prompting strategies that Sid mentioned, like chain-of-thought prompting, which are useful but are not things that, you know, my family members that are not scientists are going to be using when they’re trying to interact with, you know, the latest large language model that has been launched. So that was one draw of the project. The other thing that I think is interesting—and the reason that this project was well-suited to the AICE program—is that around the time that we were starting to work on this project, a bunch of research was coming out, and I’ve contributed to some of this research on a different project, on the impacts that generative AI can have on different economic outcomes that we care about. So things like productivity and job performance. And one interesting pattern that has emerged across numerous different studies trying to answer those types of questions is that the benefits of generative AI are often not uniform. Usually, generative AI really helps some workers, and there are other workers that it doesn’t help as much. And so there’s some interesting questions around why is it that some people are able to unlock big productivity gains using generative AI and others can’t. And one potential reason for this is the ways that people prompt the models, right. So I think understanding how people are actually interacting with these models when they’re trying to do work is a big part of understanding the potential impact that these models can have on the economy.

HUIZINGA: OK, it’s “how I met your mother” time. Let’s talk for a minute about how you two came to be working, along with what you’ve referred to as a “crack team” of researchers, on this study. So, Sid, why don’t you tell us, as you remember it, who called who, how the conversation went down, and who’s all involved. And then David can confirm, deny, or add color from his perspective.

SURI: OK, I need you to mentally rewind back to, like, November 2020. So it’s, like, just before Thanksgiving 2020. My manager came to me, and she was like, Sid, we need somebody to understand, what are the effects of AI on society? And I was like, “Oh, yeah, small question! Yeah, I can do that by myself! Yeah. I’ll get you an answer by Tuesday,” OK? Like, what the heck, man? That was like one of the biggest questions of all time. The first thing I did was assemble a team. We write an agenda; we start going forward from there. You know, Scott Counts is a colleague of mine; he was on that team. Not long after that … as I had mentioned before, David was my intern, and he and I started brainstorming. I don’t remember who called who. Maybe David does. I don’t remember that. But what I do remember is having several fun, productive brainstorming conversations with him. I remember vividly, I was, like, sort of walking around my house, you know, upstairs, kind of, trying to bounce ideas off of him and get the creative juices flowing. And one of the things we were talking about was, I just felt like, again, this is early on, but prompting is the thing. Like, everybody’s talking about it; nobody knows how to do it; people are arguing. So David and I were brainstorming, and then we came up with this idea of studying prompting and how prompting changes as the models get better and better, which they are, at a torrid rate. And so that was our, sort of, key question. And then David actually was primarily involved in assembling the crack team, and he’s going to talk more about that. But as a side note, it’s really cool for me to see David, kind of, grow from being, you know, just a great, sort of, individual scientist to, like, the leader of this team, so that was, kind of, a cool thing for me to see.

HUIZINGA: Hmm. You know, you tell that story … Peter Lee, who’s the president of Microsoft Research, tells a similar story where a certain CEO from a certain company came and dropped him in the middle of the AI and healthcare ocean and said find land. So did it have that same sort of “overwhelmed-ness” to it when you got asked to do this?

SURI: Overwhelmed would be an understatement! [LAUGHTER] It was overwhelming to the point where I was borderline afraid.

HUIZINGA: Oh, dear!

SURI: Like, you know, Peter has this analogy you mentioned, you know, “dropped in the ocean, find land.” I felt like I was dropped in outer space and I had to find Earth. And I didn’t even … I couldn’t even see the sun. Like, I … there was this entirely new system out there. No one knew how to use it. What are the right questions to ask? We were using the system to study how people use the system? Like, what the heck is going on? This was, like, stress levels were on 12. It was a sort of wild, white-knuckle, anxiety-inducing, fun, intense ride. All of those emotions wrapped up together. And I’m happy it’s over [LAUGHS] because, you know, I don’t think it was sustainable, but it was an intensely productive, intensely … again, just in case there’s any budding scientists out there, whenever you’re like swimming around in a problem and your gut is a little scared, like, I don’t know how to do this. I don’t know if I’m doing this right. You’re probably working on the right problem. Because if you know how to do it and you know how to do it right, it’s probably too easy.

HUIZINGA: Yeah!

SURI: And in this moment, boy, my gut was telling me that nobody knows how to do this and we got to figure this out.

HUIZINGA: Right. David, from your theater background, did you have some of these same emotions?

HOLTZ: Yeah, I think so. I think Sid and I, it’s interesting, we have different perspectives on this kind of interesting generative AI moment. And to use the theater analogy, I think being, you know, like, a researcher at Microsoft, Sid has kind of been able, the whole time, to see behind the curtain and see everything that’s going on. And then as someone that is, you know, a researcher in academia, I’ve sort of been in the audience to some extent. Like, I can see what’s coming out onto the stage but haven’t seen all the craziness that was happening behind the curtain. And so I think for me, the way that I would tell the story of how this project came together is, after I had finished my internship and Sid and I—and a number of coauthors—had this very successful remote work paper, we just kept in touch, and every few weeks we’d say, hey, you know, want to chat, see what we’re both working on, swap research ideas?

HUIZINGA: Yeah …

HOLTZ: And for me, I was always looking for a way to work together with Sid. And if you look around at, you know, the history of science, there’s these Kahneman and Tversky, like, Watson and Crick. Like, there are these teams that stay together over long periods of time and they’re able to produce really amazing research, and so I realized that one thing that I should prioritize is trying to find people that I really like working together, that I really click with, and just trying to keep on working with those people. Because that’s one of the keys to having a really successful career. At the same time, all this generative AI stuff was happening, and I went to a few talks. One of them was on the Berkeley campus, and it was a talk by someone at Microsoft Research, and it was about, sort of, early signs of how amazing, you know, GPT-4 was. And I remember thinking, this seems like the most important thing that a person could be working on and studying right now. Like, anything else that I’m working on seems unimportant in comparison to the impact that this technology …

HUIZINGA: Wow …

HOLTZ: … is poised to have on so many different facets of, you know, life and the economy and things like that. And so I think things kind of came together nicely in that there was this opportunity for Sid and I to work together again and to work together again on something that we both agreed was just so incredibly important. And I think we realized this is really important. We really want to work on this problem. But we’re also both super busy people, and we don’t necessarily have all the skills that we need to do this project. And given how important this question is and how quickly things are moving, we can’t afford to have this be a project where it’s like, every now and then … we come back to it … maybe we’ll have a paper in, like, three years. You know, like, things needed to happen really quickly. And so that’s where we got to thinking, OK, we need to put together a team. And that’s kind of where this, like, almost, like, Ocean’s Eleven, sort of, scene emerged [LAUGHTER] where we’re like, we’re putting together a team. We need a set of people that all have very particular skills, you know, and I’m very lucky that I did my PhD at MIT in this sort of community that is, I would say, one of the highest concentrations of really skilled computational social scientists in the world, basically.

HUIZINGA: Wow.

HOLTZ: And so I, sort of, went to, you know, to that community and looked for people. I reached out to people that I had met during the PhD admissions program that were really promising, you know, young PhD students that might want to work on the project and, sort of, put the team together. And so this project is not just Sid and I. It’s six other people: Eaman Jahani, Ben Manning, Hong-Yi TuYe, Joe Zhang, Mohammed Alsobay, and Christos Nicolaides. And everyone has brought something unique and important to the project. And it’s really kind of crazy when you think about it because on the one hand, you know, sometimes, when we’re talking, it’s like, wow, eight people. It’s really a lot of people to have on a paper. But at the same time, you, kind of, look at the contributions that every single person made to the project and you, kind of, realize, oh, this project actually could not have happened if any one of these people were not involved. So it’s been a really interesting and fun project in that way.

SURI: One thing I just wanted to add Gretchen is, I’m a little bit older than David, and when I look back at my career and my favorite projects, they all have that property that David was alluding to. If you knocked one of the coauthors off that project, it wouldn’t have been as good. To this day, I can’t figure out why is that so important, but it is. It’s just this notion that everyone contributed something and that something was unique that no one else would have figured out.

HUIZINGA: Well, and the allusion to Ocean’s Eleven is exactly that. Like, they have to get someone who can crack a safe, and they have to get someone who’s a contortionist and can fit into a box that no one can see, and blah, blah, blah. And I don’t know if you’ve argued about which one of you is George Clooney and which one of you is Brad Pitt, but we’ll leave that for a separate podcast.

SURI: Well, actually … [LAUGHTER] it’s not even a question because Eaman Jahani is by far the most handsome one of us, so he’s Brad Pitt. It’s not even close. [LAUGHS]

HUIZINGA: David’s giggling!

HOLTZ: Yeah, I think Sid … I’d agree with that. I think Sid is probably George Clooney.

SURI: I’ll take it. I’ll take it!

HUIZINGA: Anytime! Well, we’ll talk about some more movies in a minute, but let’s get into the details of this research. And, Sid, I was looking at some of the research that you’re building on from your literature, and I found some interesting papers that suggest there’s some debate on the topic. You’ve just alluded to that. But let’s talk about the titles: AI’s hottest job: Prompt engineer, and, like, Tech’s hottest new job: AI whisperer. No coding required. But then there’s this Harvard Business Review article titled AI prompt engineering isn’t the future. And that left me wondering who’s right. So I suspect this was part of the “prompting” for this research. Tell us exactly what you did and how you did it.

SURI: Sure, so where we came to this question was, we came at it from a couple directions. One is what you just said. There’s this conversation going on in the public sphere, which is on the one hand, there’s these jobs; there’s this notion that prompting, prompt engineering, is a super important thing; it’s paying six figures. On the other hand, there’s also this notion that these models are getting better and better. They’re more able to figure out what you needed and guess what you needed and so maybe we’re not going to need prompting going forward.

HUIZINGA: Right.

SURI: And David and I were like, this is perfect. One of my mentors, Duncan Watts, I always joke with him that every introduction of our paper is the same. It’s “There’s this group of people that say x, and there’s this group of people that say the opposite of x. So we did an experiment to figure it out.” And the reason why every introduction of one of my papers is the same is because you can never say at the end it was obvious. If it was so obvious, then how come there’s two groups of people disagreeing on what the outcome’s going to be? So what we did in the experiment—it’s very simple to explain—is we gave people a target image, and then they randomly either got DALL-E 2 or DALL-E 3. And we said, “OK, write a prompt to generate this target image that we’ve given you,” and we give them 10 tries. “And you can iterate; you can improve; you can experiment. Do whatever you want.” And the notion was, as models progress, what is the relationship between people’s ability to prompt them to get to the target?

HUIZINGA: That’s the end of it. [LAUGHS]

SURI: Yeah. [LAUGHS]

HUIZINGA: That’s the most succinct explanation of a research study that I’ve ever heard. Congratulations, Sid Suri! So I have a question, and this is like … you’ve talked a bit already about how you iterate to get to the target image. My experience is that it can’t remember what I told it last time. [LAUGHTER] So if I put something in and then I say, well, I want you to change that, it starts over, and it doesn’t remember what color red it put in the first image. Is that part of the process, or are these models better than what I’ve done before?

SURI: The models are changing, and that is … and, sort of, the history, the context, the personalization is what you’re referring to. That is coming online in these models already and in the near future. Maybe at the time we did the study, it wasn’t so common. And so they were suffering the same issue that you just alluded to. But going forward, I do expect that to, sort of, fade away a little.

HUIZINGA: OK. Well, David, Sid’s just given us the most beautifully succinct description of people trying to get the model to give them the target image and how many tries they got. What did you find? What were the big takeaways of this research?

HOLTZ: So let me start out with the most obvious finding that, you know, like, Sid was saying, ideally, you know, you’re, kind of, answering a question where it makes sense that people are on both sides of this argument. One thing that we looked at that you’d be surprised if there was someone on the other side of the argument is, OK, do people do a better job when we give them the better model? If we give them DALL-E 3 instead of DALL-E 2, do they do a better job of re-creating the target image? And the answer is unsurprisingly, yes. People do a better job when we give them the better model. The next thing that we looked at—and this is where I think the results start to get interesting—is why do they do better with the better model? And there’s a couple of different reasons why this can be the case. The first could be that they’re writing the exact same prompts. They interact with the model exactly the same, whether it’s DALL-E 2 or DALL-E 3, and it’s just the case that DALL-E 3 is way better at taking that input and translating it into an image that is the image that you had in mind with that prompt. So, you know, sort of, imagine there’s two different artists. One is like a boardwalk caricature artist; the other one is Vincent van Gogh. Like, one of them is probably going to be better at taking your input and producing a really high-quality image that’s what you had in mind. The other possibility is that people, sort of, pick up on the fact that one of these models is different than the other. Maybe it’s more expressive. Maybe it responds to different types of input differently. And as you start to figure that out, you’re going to actually prompt the model, kind of, differently. And so I think the analogy I would draw here is, you know, imagine that you’re driving a couple of different cars maybe, like, one has really nice power steering and four-wheel drive and things like that. The other one doesn’t have all these cool features. You know, you’re probably going to actually handle that car a little bit differently when you take it out on the road relative to a really simple car. And what we find when we actually analyze the data is that both of these factors contributes to people doing better with the higher-quality model. And they actually both contribute equally, right. So insofar as people do better with DALL-E 3, half of that is because DALL-E 3 is just a better model at, like, taking the same input and giving you, like, an image that’s closer to what you had in mind. But the other half is due to the fact that people, sort of, figure out on their own, oh, this model is different. This model is better. It can maybe respond to my inputs a little bit more expressively. And they start prompting differently. And one thing that’s really neat and interesting about the study is we didn’t tell people whether they were given DALL-E 2 or DALL-E 3. So it’s not even like they said, oh, you gave me the good model. OK, let me start prompting differently. They kind of just figure this out by interacting with the tool and kind of, you know, realizing what it can do and what it can’t do. And specifically when we look at what people are doing differently, they’re, kind of, writing longer prompts; they’re writing more descriptive prompts. They have way more nouns and verbs. They’re kind of doing less feeling around in the dark and kind of finding, like, a way of interacting with the model that seems to work well. And they’re kind of doubling down on that way of interacting with the model. And so that’s what we saw. And so when it connects back to your question of, you know, OK, prompt engineering, like, is it here to stay, …

HUIZINGA: Yeah.

HOLTZ: … or is prompt engineering going away? I think one way that we think about interpreting these results is that the prompts do matter, right. Like, if you didn’t think about how to prompt different models and you just wrote the same prompts and left that prompt “as is” for, you know, months or years, you’d be missing out on tons of the gains that we stand to experience from these new, more powerful models because you need to update the prompts so that they take advantage of the new model capabilities. But on the flip side, it’s not like these people needed to, you know, go read the literature on all these complicated, esoteric prompting strategies. They kind of figured it out on their own. And so it seems like prompting is important, but is it necessarily prompt engineering, where it’s this really, you know, heavy-duty, like, thing that you need to do or you maybe need to go take, like, a class or get a master’s degree? Maybe not. Maybe it’s just a matter of people interacting with the models and, kind of, learning how to engage with them.

HUIZINGA: Well, David, I want to ask you another question on that same line, because AI is moving so fast on so many levels. And it’s still a relatively new field. But now that you’ve had some time to reflect on the work you just did, is there anything that’s already changed in the conversation around prompt engineering? And if so, what are you thinking about now?

HOLTZ: Yeah. Thanks for the question. Definitely things are changing. I mean, as Sid mentioned, you know, more and more the way that people interact with these models, the models have some notion of history. They have some notion of context. You know, I think that informs how people are going to write prompts. And also, the types of things that people are trying to do with these models is constantly changing, right. And so I think as a result, the way that we think about prompting and, sort of, how to construct prompts is also evolving. So I think the way that we think about this study is that it’s by no means, you know, the definitive study on prompt engineering and how people learn to prompt. I think everyone on our team would agree there’s so much more to do. But I think the thing that struck us was that this debate that we mentioned earlier, you know, is prompting important? Will prompt engineering stay? Maybe it doesn’t matter? It was really a debate that was pretty light on evidence. And so I think the thing that we were excited to do is to sort of, you know, start to chip away at this big question with data and with, you know, an experiment and just try to start developing some understanding of how prompting works. And I think there’s tons more to do.

HUIZINGA: Right, right, right.

SURI: Just to add to that …

HUIZINGA: Yeah, please.

SURI: Again, if there’s any sort of young scientists out there, one of the things I hate doing with other scientists is arguing about what’s the answer to this question. So what I always do when there’s an argument is I just shift the argument to instead of arguing about is this question going to be yes or no, is what’s the data we need to answer the question? And that’s where David and I, sort of, came in. There was this argument going on. Instead of just arguing between the two of us about what we think it’s going to be, we just shifted the conversation to, OK dude, what data do we need to gather to figure out the answer to this question? And then boom, this project was off and running.

HUIZINGA: You know, that could solve so many arguments, you know, in real life, just like, you don’t know and I don’t know, why are we arguing? Let’s go find out.

SURI: Yeah, so instead of arguing about who knows what, let’s argue about what’s the data we need so that we’ll be convinced!

HUIZINGA: Well, on that line, Sid, another paper in the literature that you looked at was called The prompt report: A systematic survey of prompting techniques. And we’ve talked a little bit about what those techniques involve. But what has your research added to the conversation? Specifically, I’m interested to know, I mean, we did talk about tricks, but is there coaching involved or is this just sort of feel-your-way-in-the-dark kind of thing? And how fine is the line between what you referred to as alchemy and chemistry in this field?

SURI: The alchemy and chemistry analogy was David’s brilliant analogy, and what he was saying was, way back when, there was alchemy, and then out of that grew chemistry. And at the moment, there’s these, sort of, niche, esoteric ways of prompting—chain-of-thought, embody a persona, this kind of thing. And how are those going to get propagated out into the mainstream? That’s how we go from alchemy to, sort of, chemistry. That was his brilliant analogy. And there’s several punchlines of our work, but one of the punchlines is, people can figure out how to take advantage of the new capabilities of these models on their own, even when they don’t know the model changed. So that’s a great democratization argument.

HUIZINGA: Hmm …

SURI: That, OK, you don’t need to be the six-figure Silicon Valley hotshot to figure this out. That maybe, maybe everyone in the world who has access—who has internet access, electricity, and access to one of these models—they can sort of pick themselves up by their own bootstraps, learn how to use these things on their own. And I want to go back to an analogy you said a while ago, which was the analogy to traditional internet search, …

HUIZINGA: Yeah.

SURI: OK? People forgot this, but we’ve learned how to search over the course of about 30 years. I’m 45 years old, so I remember the early search engines like AltaVista, Lycos, things like that. And basically, getting anything useful out of them was pretty much impossible. I really wanted to swear right there, but I didn’t. [LAUGHTER] And what people forgot, people forgot that they didn’t know how to ride a bike, OK? And they forgot that we didn’t actually know … these systems didn’t work that well; we didn’t know how to query them that well; we didn’t know how to get anything useful out of them. And then 30 years later, no one thinks about searching the internet as a thing we do. It’s like turning on the faucet. You just do it. It’s taken for granted. It’s part of our workflows. It’s part of our daily life. We do it without thinking about it. Right now, we’re back in those AltaVista/Lycos days, like, where, you know, it’s still esoteric. It’s still niche. We’re still not getting what we need out of these models. The models are going to change. People are going to get better at it. And part of what we’re arguing in our paper is that people can get better at it on their own. All they need is access and a few tries and they figure it out.

HUIZINGA: Right. You know what’s really funny is, I was trying to find some information about a paper on Sparks. That’s the Sparks paper. And I was doing some internet search, and I wasn’t getting what I wanted. And then I moved over to ChatGPT and put basically the same question, but it was a little more question-oriented instead of keywords, and it gave me everything I was looking for. And I thought, wow, that’s a huge leap from even … that I could use ChatGPT like a search engine only better. So … well, listen, anyone who’s ever listened to my podcast knows I’m borderline obsessed with thinking about unintended consequences of technical innovation, so I always ask what could possibly go wrong if you got everything right. But as I’ve said on this series before, one of the main mandates of AICE research is to identify unintended consequences and try to get ahead of them. So, David, rather than talking about the potential pitfalls of prompt engineering, instead talk about what we need to do to keep up with or keep ahead of the speeding train of generative AI. And by we, I mean you.

HOLTZ: Yeah, I mean, I think the thing to keep in mind—and I think this has come up a couple of times in this conversation already—is at least right now, and presumably for the foreseeable future, you know, generative AI is moving so fast and is also not a monolith, right. Like, I think we tend to talk about generative AI, but there’s different types of models, even within a particular class of models. There’s so many different models that are floating around out there. And so I think it’s important to just keep on sort of revisiting things that we think we already know, seeing if those things remain true. You know, I think from a research perspective, like, kind of, answering the same questions over and over with different models over time and seeing if the results stay the same. And I think that’s one of the big takeaways from, like, sort of, a policy or applications perspective from our research, as well, is that just generative AI is moving really quickly. These models are evolving, and the way that we interact with them, the way that we prompt them, needs to change. So if you think about it, you know, there are many tech companies, many startups, that are building products or building entire, you know, companies on, basically, on top of API calls to OpenAI or to Anthropic or something like that. And behind the scenes, those models are changing all the time, whether it’s, you know, sort of a publicly announced shift from GPT-3.5 to GPT-4 or whether it’s the fact that maybe, you know, GPT-4 is kind of being tweaked and adjusted, you know, every couple of weeks based on things that are happening internally at the company. And one of the takeaways from our research is that, you know, all those tweaks are actually pretty meaningful. The prompts that you wrote two weeks ago might not be as effective you know today if they aren’t as well suited to the to the newest, latest, greatest model. And so I think just being really cognizant of that moving target, of the fact that we are living through, sort of, like, very exciting, unprecedented, crazy times and kind of just staying alert and staying on our toes is I think probably the most important thing.

HUIZINGA: Yeah. You know, when I was thinking about that question, I, my mind went to the Wallace & Gromit … I don’t know if you’re familiar with those animations, but there’s a scene where they’re on a toy train track chasing a criminal penguin, and they run out of track and then Gromit miraculously finds spare track. He starts laying it as the train is going. And it sort of feels like there’s a little bit of that in your research! [LAUGHS] I usually ask my guests on Collaborators where their research is on the spectrum from lab to life. But you’ve actually completed this particular study, and it leans more toward policy than product. And again, we’ve talked about a lot of this. Sometimes there seems to be a Venn diagram overlap with my questions. But, Sid, I want to know from your perspective, what would be a good outcome for this particular study, in your mind?

SURI: So AI systems are more and more being embedded in the workflows of companies and institutions. It used to just be all software, but now it’s specifically custom-built software, AI systems, and their prompts. I see it all the time here at Microsoft. It’s part of our workflows. It’s part of our products. It’s part of our day-to-day life. And as the models are getting better and better and these prompts are sort of embedded in our systems, someone’s got to pay attention to those prompts to make sure they’re still behaving the way we thought they were because they were written for an older version, the model changed, and now is that new model interpreting that prompt in the same way? That’s one question. The second question is, well, the new model has new capabilities, so now can you boost these prompts to take advantage of those new capabilities, to get the full economic gain, the full productivity gain of these new models? So you want to get your value for your money, so you need to adjust your prompts in response to those new models to get the full value. And part of the point of this paper is that that’s actually not that big a deal. That, as the models get better and better, even when people don’t know about it, they can still take advantage of the new affordances, the new capabilities, even when they aren’t made aware that, hey, it does a different thing right now.

HUIZINGA: Interesting.

SURI: But the point we’re making with this paper is, you have to pay attention to that.

HUIZINGA: OK, it’s last word time and I want to go a little off script with you two for this show. NVIDIA’s co-founder and CEO Jensen Huang recently said, and I paraphrase Willie Nelson here, “Mamas don’t let your babies grow up to be coders.” In essence, he’s predicting that AI is going to do that for us in the future and people would be better served pursuing different educational priorities. So that’s a bold claim. Do you guys want to make a bold claim? Here’s your chance to make a pithy prediction from your perch in research. What’s something you think will be true some years out? You don’t have to say how many years, but that you might have been reluctant to say out loud for fear that it wouldn’t age well. Remember, this is a podcast, not a paper, so no one’s going to hold you to your word, but you might end up being prophetic. Who knows? David, you go first, and then Sid can close the show. Tell us what’s going to happen in the future.

HOLTZ: I’m not sure how bold of a prediction this is, but I think there’s a lot of concern right now about the impact that AI will have in various creative domains, right. As generative AI gets better and AI can produce images and music and videos, you know, what will happen to all of the people that have been making a living creating this type of content? And my belief is that, if anything, as we just get flooded with more and more AI-generated content, people are going to place a really heavy premium on content that is produced by humans. Like, I think so much of what people value about art and creative output is the sort of human connection and the idea that something sort of emerged from someone’s lived experiences and hardships. I mean, this is why people really like reading, you know, the curator’s notes when they go to a museum, so that they can kind of understand what’s behind, you know, behind the image. And so I think generative AI is going to be really amazing in a lot of ways, and I think it will have really big impacts that we’ll need to deal with as a society in terms of how it affects work and things like that. But I don’t think that we’re moving towards a future where, you know, we’re all just consuming AI-generated, you know, art all the time and we don’t care at all about things being made by people.

HUIZINGA: You know, there’s a podcast called Acquired, and they talked about the brand Hermès,which is the French luxury leather company, and saying that to get a particular kind of bag that’s completely handmade—it’s an artifact from a human—that’s why you pay tens of thousands of dollars for those instead of a bag that comes off a factory line. So I like that. Sid, what do you think?

SURI: So I’m going to make two points. David made the argument about AI affecting the creative space. I want to zoom in on the knowledge workspace.

HUIZINGA: Hmm …

SURI: And one of the big issues in knowledge work today is it’s incredibly difficult still to get insights out of data. To give you an example, in the remote work study that David and I did, it took a handful of PhDs, tons of data, two years, sophisticated statistical techniques to make sense of what is the effect of remote work on information workers, OK? And I feel, where I see knowledge work going is there’s going to be this great democratization on how to get insights out of data. These models are very good at classifying things, summarizing things, categorizing things. Massive amounts of data. In the old days, you had to like basically be an advanced statistician, be an advanced machine learning person, train one of these models. They’re very esoteric. They’re very arcane. They’re very hard to use. And then unleash it on your data. Now if you just know how to prompt a little bit, you can get these same insights as a professional statistician would a few years ago in a much, much shorter time, you know, one-tenth of the time. So I feel like there’s going to be this great democratization of getting insights out of data in the knowledge workspace. That’s prediction number one. And then the second point I wanted to make, and I want to give a little credit to some of the academics who’ve inspired this notion, which is Erik Brynjolfsson and David Autor, and that is this: I think a lot of people are looking for the impact of AI in kind of the wrong way. Rewind in your mind back to the time when, like, the internal combustion engine was invented. OK, so we used to get around with horses; now we have cars. OK, horses went 20 miles an hour; cars go 40 miles an hour. OK, big deal. What no one foresaw was there’s going to be an entire aviation industry that’s going to make it possible to do things we couldn’t do before, speed up the economy, speed up everything, add trillions of dollars of value to the world. And I feel like right now everyone’s focusing on AI to do things we already know how to do. And I don’t think that’s the most interesting use case. Let’s instead turn our attention to, what could we not do before that we can do now?

HUIZINGA: Right.

SURI: And that’s where the really exciting stuff is. So those are the two points I’d like to leave you.

HUIZINGA: I love it. I hope you’re not saying that I could rewind my mind to when the internal combustion engine was developed …

SURI: No, no. Present company excluded! [LAUGHTER]

HUIZINGA: Oh my gosh. Sid Suri, David Holtz, this has been fantastic. I can’t get the phrase “AI whisperer” out of my head now, [LAUGHTER] and I think that’s what I want to be when I grow up. So thanks for coming on the show to share your insights on the topic and help to illuminate the path. This is awesome.

SURI: Thank you.

HOLTZ: Well, thank you.

SURI: That was fun.

[MUSIC FADES]

The post Collaborators: Prompt engineering with Siddharth Suri and David Holtz appeared first on Microsoft Research.

Read More

From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy

From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy

AI2BMD blog hero - illustration of a chip with network nodes extending from all sides

The essence of the biological world lies in the ever-changing nature of its molecules and their interactions. Understanding the dynamics and interactions of biomolecules is crucial for deciphering the mechanisms behind biological processes and for developing biomaterials and drugs. As Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and wigglings of atoms.” Yet capturing these real-life movements is nearly impossible through experiments. 

In recent years, with the development of deep learning methods represented by AlphaFold and RoseTTAFold, predicting the static crystal protein structures has been achieved with experimental accuracy (as recognized by the 2024 Nobel Prize in Chemistry). However, accurately characterizing dynamics at an atomic resolution remains much more challenging, especially when the proteins play their roles and interact with other biomolecules or drug molecules.

As one approach, Molecular Dynamics (MD) simulation combines the laws of physics with numerical simulations to tackle the challenge of understanding biomolecular dynamics. This method has been widely used for decades to explore the relationship between the movements of molecules and their biological functions. In fact, the significance of MD simulations was underscored when the classic version of this technique was recognized with a Nobel Prize in 2013 (opens in new tab) (opens in new tab), highlighting its crucial role in advancing our understanding of complex biological systems. Similarly, the quantum mechanical approach—known as Density Functional Theory (DFT)—received its own Nobel Prize in 1998 (opens in new tab) (opens in new tab), marking a pivotal moment in computational chemistry.  

In MD simulations, molecules are modeled at the atomic level by numerically solving equations of motions that account for the system’s time evolution, through which kinetic and thermodynamic properties can be computed. MD simulations are used to model the time-dependent motions of biomolecules. If you think of proteins like intricate gears in a clock, AI2BMD doesn’t just capture them in place—it watches them spin, revealing how their movements drive the complex processes that keep life running.

MD simulations can be roughly divided into two classes: classical MD and quantum mechanics. Classical MD employs simplified representations of the molecular systems, achieving fast simulation speed for long-time conformational changes but less accurate. In contrast, quantum mechanics models, such as Density Functional Theory, provide ground-up calculations, but are computationally prohibitive for large biomolecules.

Ab initio biomolecular dynamics simulation by AI 

Microsoft Research has been working on the development of efficient methods aiming for ab initio accuracy simulations of biomolecules. This method, AI2BMD (AI-based ab initio biomolecular dynamics system), has published in the journal Nature (opens in new tab), representing the culmination of a four-year research endeavor.

AI2BMD efficiently simulates a wide range of proteins in all-atom resolution with more than 10,000 atoms at an approximate ab initio—or first-principles—accuracy. It thus strikes a previously inaccessible tradeoff for biomolecular simulations than standard simulation techniques – achieving higher accuracies than classical simulation, at a computational cost that is higher than classical simulation but orders of magnitude faster than what DFT could achieve. This development could unlock new capabilities in biomolecular modeling, especially for processes where high accuracy is needed, such as protein-drug interactions. 

Fig.1 The overall pipeline of AI2BMD. Proteins are divided into protein units by a fragmentation process. The AI2BMD potential is designed based on ViSNet, and the datasets are generated at the DFT level. It calculates the energy and atomic forces for the whole protein. The AI2BMD simulation system is built upon these components and provides a generalizable solution for simulating the molecular dynamics of proteins. It achieves ab initio accuracy in energy and force calculations. Through comprehensive analysis from both kinetics and thermodynamics perspectives, AI2BMD exhibits good alignment with wet-lab experimental data and detects different phenomena compared to molecular mechanics.
Figure 1. The flowchart of AI2BMD

AI2BMD employs a novel-designed generalizable protein fragmentation approach that splits proteins into overlapping units, creating a dataset of 20 million snapshots—the largest ever at the DFT level. Based on our previously designed ViSNet (opens in new tab), a universal molecular geometry modeling foundation model published in Nature Communications (opens in new tab) and incorporated into PyTorch Geometry library (opens in new tab), we trained AI2BMD’s potential energy function using machine learning. Simulations are then performed by the highly efficient AI2BMD simulation system, where at each step, the AI2BMD potential based on ViSNet calculates the energy and atomic forces for the protein with ab initio accuracy. By comprehensive analysis from both kinetics and thermodynamics, AI2BMD exhibits much better alignments with wet-lab data, such as the folding free energy of proteins and different phenomenon than classic MD.   

Microsoft research podcast

Abstracts: August 15, 2024

Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.


Advancing biomolecular MD simulation

AI2BMD represents a significant advancement in the field of MD simulations from the following aspects: 

(1) Ab initio accuracy: introduces a generalizable “machine learning force field,” a machine learned model of the interactions between atoms and molecules, for full-atom protein dynamics simulations with ab initio accuracy.

Fig.2 Evaluation of energy and force calculations by AI2BMD and molecular mechanics (MM). The upper panel exhibits the folded structures of four evaluated proteins. The lower panel exhibits the mean absolute error (MAE) of potential energy.
Figure 2. Evaluation on the energy calculation error between AI2BMD and Molecular Mechanics (MM) for different proteins. 

(2) Addressing generalization: It is the first to address the generalization challenge of a machine learned force field for simulating protein dynamics, demonstrating robust ab initio MD simulations for a variety of proteins. 

(3) General compatibility: AI2BMD expands the Quantum Mechanics (QM) modeling from small, localized regions to entire proteins without requiring any prior knowledge on the protein. This eliminates the potential incompatibility between QM and MM calculations for proteins and accelerates QM region calculation by several orders of magnitude, bringing near ab initio calculation for full-atom proteins to reality. Consequently, AI2BMD paves the road for numerous downstream applications and allows for a fresh perspective on characterizing complex biomolecular dynamics.

(4) Speed advantage: AI2BMD is several orders of magnitude faster than DFT and other quantum mechanics. It supports ab initio calculations for proteins with more than 10 thousand atoms, making it one of the fastest AI-driven MD simulation programs among multidisciplinary fields.

Fig.3 Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software. The left panel shows the time consumption of AI2BMD and DFT. The right panel shows the time consumption of AI2BMD, DPMD and Allegro.
Figure 3. Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software. 

(5) Diverse conformational space exploration: For the protein folding and unfolding simulated by AI2BMD and MM, AI2BMD explores more possible conformational space that MM cannot detect. Therefore, AI2BMD opens more opportunities to study flexible protein motions during the drug-target binding process, enzyme catalysis, allosteric regulations, intrinsic disorder proteins and so on, better aligning with the wet-lab experiments and providing more comprehensive explanations and guidance to biomechanism detection and drug discovery. 

Fig.4 Analysis of the simulation trajectories performed by AI2BMD. In the upper panel, AI2BMD folds protein of Chignolin starting from an unfolded structure and achieves smaller energy error than MM. In the lower panel, it explores more conformational regions that MM cannot detect.
Figure 4. AI2BMD folds protein of Chignolin starting from an unfolded structure, achieves smaller energy error than MM and explores more conformational regions that MM cannot detect. 

(6) Experimental agreement: AI2BMD outperforms the QM/MM hybrid approach and demonstrates high consistency with wet-lab experiments on different biological application scenarios, including J-coupling, enthalpy, heat capacity, folding free energy, melting temperature, and pKa calculations.

Looking ahead

Achieving ab initio accuracy in biomolecular simulations is challenging but holds great potential for understanding the mystery of biological systems and designing new biomaterials and drugs. This breakthrough is a testament to the vision of AI for Science—an initiative to channel the capabilities of artificial intelligence to revolutionize scientific inquiry. The proposed framework aims to address limitations regarding accuracy, robustness, and generalization in the application of machine learning force fields. AI2BMD provides generalizability, adaptability, and versatility in simulating various protein systems by considering the fundamental structure of proteins, namely stretches of amino acids. This approach enhances energy and force calculations as well as the estimation of kinetic and thermodynamic properties. 

One key application of AI2BMD is its ability to perform highly accurate virtual screening for drug discovery. In 2023, at the inaugural Global AI Drug Development competition (opens in new tab),  AI2BMD made a breakthrough by predicting a chemical compound that binds to the main protease of SARS-CoV-2. Its precise predictions surpassed those of all other competitors, securing first place and showcasing its immense potential to accelerate real-world drug discovery efforts. 

Since 2022, Microsoft Research also partnered with the Global Health Drug Discovery Institute (GHDDI), a nonprofit research institute founded and supported by the Gates Foundation, to apply AI technology to design drugs that treat diseases that unproportionally affect low- and middle- income countries (LMIC), such as tuberculosis and malaria. Now, we have been closely collaborating with GHDDI to leverage AI2BMD and other AI capabilities to accelerate the drug discovery process. 

AI2BMD can help advance solutions to scientific problems and enable new biomedical research in drug discovery, protein design, and enzyme engineering.  

The post From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy appeared first on Microsoft Research.

Read More

Abstracts: November 5, 2024

Abstracts: November 5, 2024

Outlined illustrations of Chris Hawblitzel and Jay Lorch for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Microsoft senior principal researchers Chris Hawblitzel and Jay Lorch join host Amber Tingle to discuss “Verus: A Practical Foundation for Systems Verification,” which received the Distinguished Artifact Award at this year’s Symposium on Operating Systems Principles, or SOSP. In their research, Hawblitzel, Lorch, and their coauthors leverage advances in programming languages and formal verification with two aims. The first aim is to help make software verification more accessible for systems developers so they can demonstrate their code will behave as intended. The second aim is to provide the research community with sound groundwork to tackle the application of formal verification to large, complex systems. 

Transcript 

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Our guests today are Chris Hawblitzel and Jay Lorch. They are both senior principal researchers at Microsoft and two of the coauthors on a paper called “Verus: A Practical Foundation for Systems Verification.” This work received the Distinguished Artifact Award at the 30th Symposium on Operating Systems Principles, also known as SOSP, which is happening right now in Austin, Texas. Chris and Jay, thank you for joining us today for Abstracts and congratulations!

JAY LORCH: Thank you for having us. 

CHRIS HAWBLITZEL: Glad to be here. 

TINGLE: Chris, let’s start with an overview. What problem does this research address, and why is Verus something that the broader research community should know about? 


HAWBLITZEL: So what we’re trying to address is a very simple problem where we’re trying to help developers write software that doesn’t have bugs in it. And we’re trying to provide a tool with Verus that will help developers show that their code actually behaves the way it’s supposed to; it obeys some sort of specification for what the program is supposed to do. 

TINGLE: How does this publication build on or differ from other research in this field, including your previous Verus-related work? 

HAWBLITZEL: So formal verification is a process where you write down what it is that you want your program to do in mathematical terms. So if you’re writing an algorithm to sort a list, for example, you might say that the output of this algorithm should be a new list that is a rearrangement of the elements of the old list, but now this rearrangement should be in sorted order. So you can write that down using standard mathematics. And now given that mathematical specification, the challenge is to prove that your piece of software written in a particular language, like Java or C# or Rust, actually generates an output that meets that mathematical specification. So this idea of using verification to prove that your software obeys some sort of specification, this has been around for a long time, so, you know, even Alan Turing talked about ways of doing this many, many decades ago. The challenge has always been that it’s really hard to develop these proofs for any large piece of software. It simply takes a long time for a human being to write down a proof of correctness of their software. And so what we’re trying to do is to build on earlier work in verification and recent developments in programming languages to try to make this as easy as possible and to try to make it as accessible to ordinary software developers as possible. So we’ve been using existing tools. There are automated theorem provers—one of them from Microsoft Research called Z3—where you give it a mathematical formula and ask it to prove that the formula is valid. We’re building on that. And we’re also taking a lot of inspiration from tools developed at Microsoft Research and elsewhere, like Dafny and F* and so on, that we’ve used in the past for our previous verification projects. And we’re trying to take ideas from those and make them accessible to developers who are using common programming languages. In this case, the Rust programming language is what we’re focusing on. 

TINGLE: Jay, could you describe your methodology for us and maybe share a bit about how you and your coauthors tested the robustness of Verus.

LORCH: So the question we really want to answer is, is Verus suitable for systems programming? So that means a variety of things. Is it amenable to a variety of kinds of software that you want to build as part of a system? Is it usable by developers? Can they produce compact proofs? And can they get timely feedback about those proofs? Can the verifier tell you quickly that your proof is correct or, if it’s wrong, that it’s wrong and guide you to fix it? So the main two methodological techniques we used were millibenchmarks and full systems. So the millibenchmarks are small pieces of programs that have been verified by other tools in the past, and we built them in Verus and compared to what other tools would do to find whether we could improve usability. And we found generally that we could verify the same things but with more compact proofs and proofs that would give much snappier feedback. The difference between one second and 10 seconds might not seem a lot, but when you’re writing code and working with the verifier, it’s much nicer to get immediate feedback about what is wrong with your proof so you can say, oh, what about this? And it can say, oh, well, I still see a problem there. And you could say, OK, let me fix that. As opposed to waiting 10, 20 seconds between each such query to the verifier. So the millibenchmarks helped us evaluate that. And the macrobenchmarks, the building entire systems, we built a couple of distributed systems that had been verified before—a key value store and a node replication system—to show that you could do them more effectively and with less verification time. We also built some new systems, a verified OS page table, a memory allocator, and a persistent memory append-only log. 

TINGLE: Chris, the paper mentions that successfully verifying system software has required—you actually use the word heroic to describe the developer effort. Thinking of those heroes in the developer community and perhaps others, what real-world impact do you expect Verus to have? What kind of gains are we talking about here? 

HAWBLITZEL: Yeah, so I think, you know, traditionally verification or this formal software verification that we’re doing has been considered a little bit of a pie-in-the-sky research agenda. Something that people have applied to small research problems but has not necessarily had a real-world impact before. And so I think it’s just, you know, recently, in the last 10 or 15 years, that we started to see a change in this and started to see verified software actually deployed in practice. So on one of our previous projects, we worked on verifying the cryptographic primitives that people use when, say, they browse the web or something and their data is encrypted. So in these cryptographic primitives, there’s a very clear specification for exactly what bytes you’re supposed to produce when you encrypt some data. And the challenge is just writing software that actually performs those operations and does so efficiently. So in one of our previous projects that we worked on called HACL* and EverCrypt, we verified some of the most commonly used and efficient cryptographic primitives for things like encryption and hashing and so on. And these are things that are actually used on a day-to-day basis. So we, kind of, took from that experience that the tools that we’re building are getting ready for prime time here. We can actually verify software that is security critical, reliability critical, and is in use. So some of the things that Jay just mentioned, like verifying, you know, persistent memory storage systems and so on, those are the things that we’re looking at next for software that would really benefit from reliability and where we can formally prove that your data that’s written to disk is read correctly back from disk and not lost during a crash, for example. So that’s the kind of software that we’re looking to verify to try to have a real-world impact. 

LORCH: The way I see the real-world impact, is it going to enable Microsoft to deal with a couple of challenges that are severe and increasing in scale? So the first challenge is attackers, and the second challenge is the vast scale at which we operate. There’s a lot of hackers out there with a lot of resources that are trying to get through our defenses, and every bug that we have offers them purchase, and techniques like this, that can get rid of bugs, allow us to deal with that increasing attacker capability. The other challenge we have is scale. We have billions of customers. We have vast amounts of data and compute power. And when you have a bug that you’ve thoroughly tested but then you run it on millions of computers over decades, those rare bugs eventually crop up. So they become a problem, and traditional testing has a lot of difficulty finding those. And this technology, which enables us to reason about the infinite possibilities in a finite amount of time and observe all possible ways that the system can go wrong and make sure that it can deal with them, that enables us to deal with the vast scale that Microsoft operates on today.

HAWBLITZEL: Yeah, and I think this is an important point that differentiates us from testing. Traditionally, you find a bug when you see that bug happen in running software. With formal verification, we’re catching the bugs before you run the software at all. We’re trying to prove that on all possible inputs, on all possible executions of the software, these bugs will not happen, and it’s much cheaper to fix bugs before you’ve deployed the software that has bugs, before attackers have tried to exploit those bugs. 

TINGLE: So, Jay, ideally, what would you like our listeners and your fellow SOSP conference attendees to tell their colleagues about Verus? What’s the key takeaway here? 

LORCH: I think the key takeaway is that it is possible now to build software without bugs, to build systems code that is going to obey its specification on all possible inputs always. We have that technology. And this is possible now because a lot of technology has advanced to the point where we can use it. So for one thing, there’s advances in programming languages. People are moving from C to Rust. They’ve discovered that you can get the high performance that you want for systems code without having to sacrifice the ability to reason about ownership and lifetimes, concurrency. The other thing that we build on is advances in computer-aided theorem proving. So we can really make compact and quick-to-verify mathematical descriptions of all possible behaviors of a program and get fast answers that allow us to rapidly turn around proof challenges from developers. 

TINGLE: Well, finally, Chris, what are some of the open questions or future opportunities for formal software verification research, and what might you and your collaborators tackle next? I heard a few of the things earlier. 

HAWBLITZEL: Yes, I think despite, you know, the effort that we and many other researchers have put into trying to make these tools more accessible, trying to make them easier to use, there still is a lot of work to prove a piece of software correct, even with advanced state-of-the-art tools. And so we’re still going to keep trying to push to make that easier. Trying to figure out how to automate the process better. There’s a lot of interest right now in artificial intelligence for trying to help with this, especially if you think about artificial intelligence actually writing software. You ask it to write a piece of software to do a particular task, and it generates some C code or some Rust code or some Java code, and then you hope that that’s correct because it could have generated any sort of code that performs the right thing or does total nonsense. So it would be really great going forward if when we ask AI to develop software, we also expect it to create a proof that the software is correct and does what the user asked for. We’ve started working on some projects, and we found that the AI is not quite there yet for realistic code. It can do small examples this way. But I think this is still a very large challenge going forward that could have a large payoff in the future if we can get AI to develop software and prove that the software is correct. 

LORCH: Yeah, I see there’s a lot of synergy between—potential synergy—between AI and verification. Artificial intelligence can solve one of the key challenges of verification, namely making it easy for developers to write that code. And verification can solve one of the key challenges of AI, which is hallucinations, synthesizing code that is not correct, and Verus can verify that that code actually is correct. 

TINGLE: Well, Chris Hawblitzel and Jay Lorch, thank you so much for joining us today on the Microsoft Research Podcast to discuss your work on Verus. 

[MUSIC] 

HAWBLITZEL: Thanks for having us. 

LORCH: Thank you. 

TINGLE: And to our listeners, we appreciate you, too. If you’d like to learn more about Verus, you’ll find a link to the paper at aka.ms/abstracts or you can read it on the SOSP website. Thanks for tuning in. I’m Amber Tingle, and we hope you’ll join us again for Abstracts.

[MUSIC FADES] 

The post Abstracts: November 5, 2024 appeared first on Microsoft Research.

Read More

Abstracts: November 4, 2024

Abstracts: November 4, 2024

Outlined illustrations of Shan Lu and Bogdan Stoica for the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Research Manager Shan Lu and Bogdan Stoica, a PhD candidate at the University of Chicago, join host Gretchen Huizinga to discuss “If At First You Don’t Succeed, Try, Try, Again … ? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems.” In the paper, which was accepted at this year’s Symposium on Operating Systems Principles, or SOSP, Lu, Stoica, and their coauthors examine typical retry issues and present techniques that leverage traditional program analysis and large language models to help detect them.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Shan Lu, a senior principal research manager at Microsoft Research, and Bogdan Stoica, also known as Bo, a doctoral candidate in computer science at the University of Chicago. Shan and Bogdan are coauthors of a paper called “If at First You Don’t Succeed, Try, Try, Again …? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems.” And this paper was presented at this year’s Symposium on Operating Systems Principles, or SOSP. Shan and Bo, thanks for joining us on Abstracts today!


SHAN LU: Thank you.

BOGDAN STOICA: Thanks for having us.

HUIZINGA: Shan, let’s kick things off with you. Give us a brief overview of your paper. What problem or issue does it address, and why should we care about it?

LU: Yeah, so basically from the title, we are looking at retry bugs in software systems. So what retry means is that people may not realize for big software like the ones that run in Microsoft, all kinds of unexpected failures—software failure, hardware failure—may happen. So just to make our software system robust, there’s often a retry mechanism built in. So if something unexpected happens, a task, a request, a job will be re-executed. And what this paper talks about is, it’s actually very difficult to implement this retry mechanism correctly. So in this paper, we do a study to understand what are typical retry problems and we offer a solution to detecting these problems.

HUIZINGA: Bo, this clearly isn’t a new problem. What research does your paper build on, and how does your research challenge or add to it?

STOICA: Right, so retry is a well-known mechanism and is widely used. And retry bugs, in particular, have been identified in other papers as root causes for all sorts of failures but never have been studied as a standalone class of bugs. And what I mean by that, nobody looked into, why is it so difficult to implement retry? What are the symptoms that occur when you don’t implement retry correctly? What are the causes of why developers struggle to implement retry correctly? We built on a few key bug-finding ideas that have been looked at by other papers but never in this context. We use fault injection. We repurpose existing unit tests to trigger this type of bugs as opposed to asking developers to write specialized tests to trigger retry bugs. So we’re, kind of, making the developer’s job easier in a sense. And in this pipeline, we also rely on large language models to augment the program and the code analysis that goes behind the fault injection and the reutilization of existing tests.

HUIZINGA: Have large language models not been utilized much in this arena?

LU: I want to say that, you know, actually this work was started about two years ago. And at that time, large language model was really in its infancy and people just started exploring what large language model can help us in terms of improving software reliability. And our group, and together with, you know, actually same set of authors from Microsoft Research, we actually did some of the first things in a workshop paper just to see what kind of things that we were able to do before like, you know, finding bugs can now be replicated by using large language model.

HUIZINGA: OK …

LU: But at that time, we were not very happy because, you know, just use large language model to do something people were able to do using traditional program analysis, I mean, it seems cool, right, but does not add new functionality. So I would say what is new, at least when we started this project, is we were really thinking, hey, are there anything, right, are there some program analysis, are there some bug finding that we were not able to do using traditional program analysis but actually can be enabled by large language model.

HUIZINGA: Gotcha …

LU: And so that was at, you know, what I feel like was novel at least, you know, when we worked on this. But of course, you know, large language model is a field that is moving so fast. People are, you know, finding new ways to using it every day. So yeah.

HUIZINGA: Right. Well, in your paper, you say that retry functionality is commonly undertested and thus prone to problems slipping into production. Why would it be undertested if it’s such a problem?

STOICA: So testing retry is difficult because what you need is to simulate the systemwide conditions that lead to retry. That often means simulating external transient errors that might happen on the system that runs your application. And to do this during testing and capture this in a small unit test is difficult.

LU: I think, actually, Bogdan said this very well. It’s like, why do we need a retry? It’s, like, when unexpected failure happen, right. And this is, like, something like Bogdan mentioned, like external transient error such as my network card suddenly does not work, right. And this may occur, you know, only for, say, one second and then it goes back on. But this one second may cause some job to fail and need retry. So during normal testing, these kind of unexpected things rarely, rarely happen, if at all, and it’s also difficult to simulate. That’s why it’s just not well tested.

HUIZINGA: Well, Shan, let’s talk about methodology. Talk a bit about how you tackled this work and why you chose the approach you did for this particular problem.

LU: Yeah, so I think this work includes two parts. One is a systematic study. We study several big open-source systems to see whether there are retry-related problems in this real system. Of course there are. And then we did a very systematic categorization to understand the common characteristics. And the second part is about, you know, detecting. And in terms of method, we have used, particularly in the detecting part, we actually used a hybrid of techniques of traditional static program analysis. We used this large language model-enabled program analysis. In this case, imagine we just asked a large language model saying, hey, tell us, are there any retry implemented in this code? If there is, where it is, right. And then we also use, as Bogdan mentioned, we repurposed unit test to help us to execute, you know, the part of code that large language model tell us there may be a retry. And addition to that, we also used fault injection, which means we simulate those transient, external, environmental failures such as network failures that very rarely would occur by itself.

HUIZINGA: Well, Bo, I love the part in every paper where the researchers say, “And what we found was …” So tell us, what did you find?

STOICA: Well, we found that implementing retry is difficult and complex! Not only find new bugs because, yes, that was kind of the end goal of the paper but also try to understand why these bugs are happening. As Shan mentioned, we started this project with a bug study. We looked at retry bugs across eight to 10 applications that are widely popular, widely used, and that the community is actively contributing to them. And the experiences of both users and developers, if we can condense that—what do you think about retries?—is that, yeah, they’re frustrated because it’s a simple mechanism, but there’s so many pitfalls that you have to be aware of. So I think that’s the biggest takeaway. Another takeaway is that when I was thinking about bug-finding tools, I was having this somewhat myopic view of, you know, you instrument at the program statement level, you figure out relationships between different lines of code and anti-patterns, and then you build your tools to find those anti-patterns. Well, with retry, this kind of gets thrown out the window because retry is a mechanism. It’s not just one line of code. It is multiple lines of code that span multiple functions, multiple methods, and multiple files. And you need to think about retry holistically to find these issues. And that’s one of the reasons we used large language models, because traditional static analysis or traditional program analysis cannot capture this. And, you know, large language models turns out to be actually great at this task, and we try to harness the, I would say, fuzzy code comprehension capabilities of large language models to help us find retry bugs.

HUIZINGA: Well, Shan, research findings are important, but real-world impact is the ultimate goal here. So who will this research help most and why?

LU: Yeah, that’s a great question. I would consider several groups of people. One is hopefully, you know, people who actually build, design real systems will find our study interesting. I hope it will resonate with them about those difficulties in implementing retry because we studied a set of systems and there was a little bit of comparison about how different retry mechanisms are actually used in different systems. And you can actually see that, you know, this different mechanism, you know, they have pros and cons, and we have a little bit of, you know, suggestion about what might be good practice. That’s the first group. The second group is, our tool actually did find, I would say, a relatively large number of retry problems in the latest version of every system we tried, and we find these problems, right, by repurposing existing unit tests. So I hope our tool will be used, you know, in the field by, you know, being maybe integrated with future unit testing so that our future system will become more robust. And I guess the third type of, you know, audience I feel like may benefit by reading our work, knowing our work: the people who are thinking about how to use large language model. And as I mentioned, I think a takeaway is large language model can repeat, can replace some of things we were able to do using traditional program analysis and it can do more, right, for those fuzzy code comprehension–related things. Because for traditional program analysis, we need to precisely describe what I want. Like, oh, I need a loop. I need a WRITE statement, right. For large language model, it’s imprecise by nature, and that imprecision sometimes actually match with the type of things we’re looking for.

HUIZINGA: Interesting. Well, both of you have just, sort of, addressed nuggets of this research. And so the question that I normally ask now is, if there’s one thing you want our listeners to take away from the work, what would it be? So let’s give it a try and say, OK, in a sentence or less, if I’m reading this paper and it matters to me, what’s my big takeaway? What is my big “aha” that this research helps me with?

STOICA: So the biggest takeaway of this paper is not to be afraid to integrate large language models in your bug-finding or testing pipelines. And I’m saying this knowing full well how imprecise large language models can be. But as long as you can trust but verify, as long as you have a way of checking what these models are outputting, you can effectively insert them into your testing framework. And I think this paper is showing one use case and bring us closer to, you know, having it integrated more ubiquitously.

HUIZINGA: Well, Shan, let’s finish up with ongoing research challenges and open questions in this field. I think you’ve both alluded to the difficulties that you face. Tell us what’s up next on your research agenda in this field.

LU: Yeah, so for me, personally, I mean, I learned a lot from this project and particularly this idea of leveraging large language model but also as a way to validate its result. I’m actually working on how to leverage large language model to verify the correctness of code, code that may be generated by large language model itself. So it’s not exactly, you know, a follow-up of this work, but I would say at idea, you know, philosophical level, it is something that is along this line of, you know, leverage large language model, leverage its creativity, leverage its … sometimes, you know … leverage its imprecision but has a way, you know, to control it, to verify it. That’s what I’m working on now.

HUIZINGA: Yeah … Bo, you’re finishing up your doctorate. What’s next on your agenda?

STOICA: So we’re thinking of, as Shan mentioned, exploring what large language models can do in this bug-finding/testing arena further and harvesting their imprecision. I think there are a lot of great problems that traditional code analysis has tried to tackle, but it was difficult. So in that regard, we’re looking at performance issues and how large language models can help identify and diagnose those issues because my PhD was mostly focused, up until this point, on correctness. And I think performance inefficiencies are such a wider field and with a lot of exciting problems. And they do have this inherent imprecision and fuzziness to them that also large language models have, so I hope that combining the two imprecisions maybe gives us something a little bit more precise.

HUIZINGA: Well, this is important research and very, very interesting.

[MUSIC]

Shan Lu, Bogdan Stoica, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts. And you can also find it on the SOSP website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 4, 2024 appeared first on Microsoft Research.

Read More

AI-powered microgrids facilitate energy resilience and equity in regional communities

AI-powered microgrids facilitate energy resilience and equity in regional communities

Three icons that represent (left to right) ecology and environment, economics, and technology for emerging markets.

The rise of affordable small-scale renewable energy, like rooftop solar panels, is reshaping energy systems around the world. This shift away from fossil fuel-powered grids creates new opportunities for energy distribution that prioritize decentralized energy ownership and community empowerment. Despite this progress, centralized energy systems still dominate, often failing to provide vulnerable communities with reliable, affordable renewable energy. In response, Microsoft researchers are collaborating with local communities to explore how AI can enable community-scale energy solutions focused on energy availability and equity as well as decarbonization.

AI-powered microgrids support resilient communities

Microgrids, small and localized energy systems, hold promise as a solution to the challenges of centralized energy systems. These microgrids can operate independently from the larger grid, providing participants with resilience and control. Figure 1 shows how these systems integrate renewable energy sources and storage to efficiently manage local energy needs.

Figure 1: The image shows a microgrid system with interconnected assets, including rooftop solar panels, battery storage locations, electric vehicle chargers, wind turbines, and large solar farms, all supporting a small community and tied to the central power grid.
Figure 1. An example of the decentralized nature of a microgrid power system

AI improves energy reliability by integrating data about energy consumption, market prices, and weather forecasts, necessary when using wind and solar power, which rely on weather conditions. Advanced forecasting predicts renewable energy availability, while AI-driven analytics determine when to generate, store, or sell electricity. This increases efficiency and stabilizes the grid by balancing supply and demand.

When powered by AI, microgrids can also contribute to energy equity. In many rural parts of the US, flat-rate billing models are still common, often leading to unfair pricing. AI-enabled microgrids provide an alternative by allowing communities to pay only for the energy they use. By analyzing consumption patterns, AI can ensure optimized distribution that promotes equitable pricing and access. These systems also improve resilience during crises, enabling communities to manage energy distribution more effectively and reduce reliance on centralized utilities. AI allows microgrids to predict energy demands, identify system vulnerabilities, and recover quickly during outages.

Evaluating AI’s impact on microgrid efficiency and equity

To explore AI’s potential in improving efficiency and equity in energy management, a team of Microsoft researchers collaborated with community organizations on simulations and a case study. They built a tabletop simulator to test whether AI could effectively determine when to generate, store, or sell electricity based on real-time data. The AI model was optimized for resilience and efficiency, using reinforcement learning to control grid and battery processes, enabling microgrids adapt to changing energy conditions and market dynamics.

This simulation used a theoretical model with external data to show how an AI-driven microgrid could autonomously buy and sell energy based on strategic design parameters. By controlling when the battery is charged and discharged based on energy production and consumption patterns, the model maximized efficiency and maintained local power availability. Figure 2 shows the AI-controlled grid’s optimal decisions using open-source data from the California Independent System Operator (CAISO), serving as a proof of concept (PoC) for AI-driven microgrids operating under real-world conditions.

Figure 2 (A): Graph depicting peak and off-peak net power bought or sold over one week using simulations of the AI controller on historical CAISO data. The graph shows a direct correlation that when solar is available then more power is bought than sold, whereas, during nighttime the controller relies on stored energy in battery to power consumption, making fewer transactions  

Figure 2 (B) The graph shows battery levels on a simulated AI controller for the historical CAISO data. During peak hours, the battery discharges as reserves are sold, while solar power supplies the load. At night, the battery conserves power, minimizing purchases and optimizing reserves for daytime selling.
Figure 2. (A) Peak and off-peak net power bought or sold over one week using simulations of the AI controller on historical CAISO data. (B) Peak and off-peak battery levels over one week using simulations of the AI controller on historical CAISO data. 

Case study: AI-powered microgrid for community energy transition

Microsoft researchers, in partnership with community-based organizations Remix: The Soul of Innovation (opens in new tab), Maverick IQ (opens in new tab) and Ayika Solutions (opens in new tab), are designing and implementing an AI-powered microgrid system in West Atlanta. Working closely with the Vicars Community Center (VCC) resilience hub (opens in new tab), they aim to address challenges faced by the community due to rapid development. West Atlanta, like many Atlanta neighborhoods, faces rising housing prices and energy costs that disproportionately affect long-time residents. Communities relying on centralized grids are more vulnerable to outages, with slow recovery times, highlighting systemic inequalities in energy distribution.

The VCC resilience hub is tackling these issues by helping to establish a solar microgrid for the West Atlanta Watershed Alliance (opens in new tab) (WAWA) community farm and surrounding neighborhoods. Microsoft researchers and collaborators are integrating AI into the microgrid to achieve energy savings, improve resilience, and create local job opportunities. Figure 3 shows the VCC resilience hub and WAWA community farm powered by the microgrid, highlighting key infrastructure for installing distributed energy resources (DERs).

Figure 3 (A) and 3 (B)  shows pictures of the VCC resilience hub, with solar panels  and batteries for energy storage 

 

Figure 3 (C) and 3 (D) shows pictures of the community farm, and volunteers at WAWA, a key center to support the future of community agriculture to be supported by the microgrid
Figure 3. A and B show the VCC resilience hub, with solar panels (left) and batteries for energy storage (right) – photographs by Erica Holloman-Hill. C and D show the WAWA community farm and community members holding freshly harvested crops. 

Project phases

Co-innovation design

Microsoft researchers, architects, and community partners held a participatory design session with state and utility representatives to define the project’s mission and key metrics. The CDC’s Social Vulnerability Index informed the site selection, supporting the project’s diversity, equity, and inclusion goals. 

Renewables and microgrid siting

A renewable siting survey conducted by community partners identified the VCC as a key resilience hub for solar panel and battery installation.

To deliver these benefits, the site first needed upgrades. Older homes required energy-efficiency improvements, such as electrical upgrades and better insulation, before they could be integrated into the microgrid. As a PoC, the team collaborated with community partners to modernize an older home with inefficient energy consumption. Sensors were installed to track energy usage and environmental conditions (Figure 4).

Figure 4: A graph showing estimated cost of electricity per day based on a legacy household in West Atlanta through kilowatt-hour usage between July 29, 2024 and August 13, 2023. Data validates the family’s experience about high energy bills, inefficient heating and cooling, and high humidity in the basement.
Figure 4. Estimated daily electricity costs based on a home’s kilowatt-hour usage between July 29 and August 13, 2023. The data confirms the residents’ experience of high energy bills, inefficient heating and cooling, and high humidity in the basement. Used by permission from Erica Holloman-Hill.

Students from Morehouse College (opens in new tab) used this data to create a digital twin of the home, which provided actionable insights (Figure 5). The analysis confirmed issues like high radon levels and energy drains from outdated appliances. Guided by these findings, the team upgraded the house into a “smart home” where AI monitors energy and environmental conditions, enabling it to join the microgrid and making it eligible for LEED certification (opens in new tab).

Figure 5: 2 Figures showing snapshots of digital twin created for Dr. Erica Holloman-Hill’s home, provided by courtesy of Dr. Erica L Holloman-Hill, owner of Ayika Solutions Inc. The first figure shows the sensor readings of pollutants and weather in various parts of the home. The second figure shows the measurements in detail for the  basement. The detailed environmental data—including climatic conditions, appliance-level energy usage, and pollutant levels—provide actionable insights for identifying targeted areas for grid modernization.
Figure 5. Smart electrification: Snapshots of digital twin created for the PoC home. Panel A shows the digital twin for the entire home. Panel B shows detailed views for the first floor and basement, respectively. The detailed environmental data—including climatic conditions, appliance-level energy usage, and pollutant levels—provide actionable insights for identifying targeted areas for grid modernization. Used by permission from Erica Holloman-Hill.

Microgrid simulation phase

To prepare the AI-powered microgrid, Microsoft researchers built a simplified tabletop prototype simulating the setup using real data from the design and siting phases. This prototype demonstrated the control mechanism’s ability to manage DERs—solar panels, batteries, and appliances—and the interface between the microgrid and the larger grid. Figure 6 shows the tabletop model during prototyping.

Figure 7 illustrates the results of this simulation, showing power bought and sold and the battery charge-discharge profile. The AI controller made optimal buying and selling decisions, promoting efficiency and reliability.

Figure 6 (A): Graph depicting peak and off-peak net power bought or sold over one week using simulations of the AI controller on data generated during runs of tabletop microgrid model. The graph shows a direct correlation that when solar is available then more power is bought than sold, whereas, during night time the controller relies on stored energy in battery to power consumption, making fewer transactions. 

Figure 6 (B) The graph shows battery levels on a simulated microgrid controller powered by AI. During peak hours, the battery discharges as reserves are sold, while solar power supplies the load. At night, the battery conserves power, minimizing purchases and optimizing reserves for daytime selling.
Figure 7. (A) Peak and off-peak net power bought or sold over one week using AI-controller simulations. (B) Corresponding battery levels.

Erica Holloman-Hill, director of WAWA, CEO of Ayika Solutions and owner of the PoC home, reflected: “This study helped me understand how our home’s outdated condition affects our quality of life. Upgrading homes like mine could make a significant difference. Thanks to partnerships like this one, controlling and sharing the electricity the community generates is within reach, highlighting the potential of AI-supported technologies like microgrids for communities like ours.”

Building on the simulation’s success, the VCC resilience hub and local organizations are continuing to install solar panels to power the microgrid. AI will play a key role in siting and controlling the system as it expands. Efforts are also underway to establish sustainable financing models and assess homes for modernization to enable broader participation in the microgrid.

AI: A path to equity and resilience

The transition to decentralized microgrids offers new opportunities for energy efficiency, with AI playing a critical role in managing these systems. Yet additional efforts are needed for communities to fully realize these benefits. Residents of aging homes are burdened with outdated wiring, inefficient appliances, and poor insulation—factors that drive up energy costs. Their dependence on centralized grids offers little relief, underscoring the need for community-focused energy solutions. 

The West Atlanta project illustrates AI’s potential to create resilient, equitable, community-driven energy systems, paving the way for a more inclusive and sustainable future. Microsoft researchers are continuing to collaborate with local organizations to promote smarter energy management.

For additional details, please review the project report.

Acknowledgements

I would like to thank all the collaborators on these projects: West Atlanta microgrid: Erica L. Holloman-Hill, John Jordan Jr, Markese Bryant. I also want to thank Karin Strauss for reviewing and providing feedback on this blog post; Andalib Samandari, the intern who supported this project; Vaishnavi Ranganathan for helping to brainstorm throughout the project; AI & Society Fellows program for supporting projects in this domain; and Microsoft’s Datacenter Community Affairs team, Jon McKenley and Kelly Lanier Arnold for supporting the project in West Atlanta. 

The post AI-powered microgrids facilitate energy resilience and equity in regional communities appeared first on Microsoft Research.

Read More

Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

Three icons that represent local and global search and GraphRAG. These icons sit on a blue to pink gradient.

GraphRAG is a technique that uses large language models (LLMs) to create knowledge graphs and summaries from unstructured text documents and leverages them to improve retrieval-augmented generation (RAG) operations on private datasets. It offers comprehensive global overviews of large, private troves of unstructured text documents while also enabling exploration of detailed, localized information. By using LLMs to create comprehensive knowledge graphs that connect and describe entities and relationships contained in those documents, GraphRAG leverages semantic structuring of the data to generate responses to a wide variety of complex user queries. Uncharted (opens in new tab), one of Microsoft’s research collaborators, has recently been expanding the frontiers of this technology by developing a new approach to processing local queries: DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal). This approach builds upon Microsoft’s GraphRAG technique, combining characteristics of both global and local search to generate detailed responses in a method that balances computational costs with quality outcomes.

How GraphRAG works

GraphRAG has two primary components, an indexing engine and a query engine.

The indexing engine breaks down documents into smaller chunks, converting them into a knowledge graph with entities and relationships. It then identifies communities within the graph and generates summaries—or “community reports”—that represent the global data structure. 

The query engine utilizes LLMs to build graph indexes over unstructured text and query them in two primary modes: 

  • Global search handles queries that span the entire dataset. This mode synthesizes information from diverse underlying sources to answer questions that require a broad understanding of the whole corpus. For example, in a dataset about tech company research efforts, a global query could be: “What trends in AI research have emerged over the past five years across multiple organizations?” While effective for connecting scattered information, global search can be resource intensive. 
  • Local search optimizes for targeted queries, drawing from a smaller subset of documents that closely match the user’s input. This mode works best when the answer lies within a small number of text units. E.g. a query asking: “What new features and integrations did Microsoft’s Cosmos DB team release on October 4th?”

The creation of these summaries often involves a human in the loop (HITL), as user input shapes how information is summarized (e.g., what kinds of entities and relationships are extracted). To index documents using GraphRAG, a clear description of the intended user persona (as defined in the indexing phase) is needed, as it influences how nodes, edges, and community reports are structured.

Introducing DRIFT Search

DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query’s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions. These follow-ups allow DRIFT to handle queries that may not fully align with the original extraction templates defined by the user at index time.

Answer details Drift (DS_Default) Local (LS)
Supply Chain Traced back to cinnamon in Ecuador and Sri Lanka
[Redacted Brand] and [Redacted Brand] Brands Impacted
Products sold at [Redacted Brand] and [Redacted Brand]
Plants in Ecuador
Contamination Levels 2000 times higher than FDA max Blood lead levels ranging from 4 to 29 micrograms per deciliter
Actions Recalls and health advisories
Investigating plant in Ecuador
Issued warnings to retailers
Recalls and health advisories
Table 1: An example of summarized responses from two search techniques (DRIFT and Local Search) on a dataset of AP News articles to the query: “Describe what actions are being taken by the U.S. Food and Drug Administration and the Centers for Disease Control and Prevention to address the lead contamination in apple cinnamon fruit puree and applesauce pouches in the United States during November 2023”. As shown in the table, DRIFT search was able to surface details not immediately available with the two other approaches.

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.


DRIFT Search: A step-by-step process 

  1. Primer: When a user submits a query, DRIFT compares it to the top K most semantically relevant community reports. This generates an initial answer along with several follow-up questions, which act as a lighter version of global search. To do this, we expand the query using Hypothetical Document Embeddings (HyDE), to increase sensitivity (recall), embed the query, look up the query against all community reports, select the top K and then use the top K to try to answer the query. The aim is to leverage high-level abstractions to guide further exploration.
  2. Follow-Up: With the primer in place, DRIFT executes each follow-up using a local search variant. This yields additional intermediate answers and follow-up questions, creating a loop of refinement that continues until the search engine meets its termination criteria, which is currently configured for two iterations (further research will investigate reward functions to guide terminations). This phase represents a globally informed query refinement. Using global data structures, DRIFT navigates toward specific, relevant information within the knowledge graph even when the initial query diverges from the indexing persona. This follow-up process enables DRIFT to adjust its approach based on emerging information. 
  3. Output Hierarchy: The final output is a hierarchy of questions and answers ranked on their relevance to the original query. This hierarchical structure can be customized to fit specific user needs. During benchmark testing, a naive map-reduce approach aggregated all intermediate answers, with each answer weighted equally. 
An image that shows a hierarchical tree with each node represented as a pie chart of weighting.
Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user’s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step.  C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive.

Why DRIFT search is effective

DRIFT search excels by dynamically combining global insights with local refinement, enabling navigation from high-level summaries down to original text chunks within the knowledge graph. This layered approach ensures that detailed, context-rich information is preserved even when the initial query diverges from the persona used during indexing. By decomposing broad questions into fine-grained follow-ups, DRIFT captures granular details and adjusts based on the emerging context, making it adaptable to diverse query types. This makes it particularly effective when handling queries that require both breadth and depth without losing specific details.

Benchmarking DRIFT search

As shown, we tested the effectiveness of DRIFT search by performing a comparative analysis across a variety of use cases against GraphRAG local search and a highly tuned variant of semantic search methods. The analysis evaluated each method’s performance based on key metrics such as:  

  • Comprehensiveness: Does the response answer all aspects of the question?
  • Diversity of responses: Does the response provide different perspectives and insights on the question?

In our results, DRIFT search provided significantly better results on both comprehensiveness and diversity in the metrics. We set up an experiment where we ingested 5K+ news articles from the Associated Press and ingested those articles using GraphRAG. After ingestion, we generated 50 “local” questions on this dataset and used both DRIFT and Local Search to generate answers for each of these questions. These “local” questions were questions that target specific details in the dataset that could be attributed to a small number of text units containing the answer. These answers were then used with an LLM judge to score for comprehensiveness and diversity.

  • On comprehensiveness, DRIFT search outperformed Local Search 78% of the time.
  • On diversity, DRIFT search outperformed Local Search 81% of the time.

Availability

DRIFT search is available now on the GraphRAG GitHub (opens in new tab).

Future research directions

A future version of DRIFT will incorporate an improved version of Global Search that will allow it to more directly address questions currently serviced best by global search. The hope is to then move towards a single query interface that can service questions of both local and global varieties. This work will further evolve DRIFT’s termination logic, potentially through a reward model that balances novel information with redundancy. Additionally, executing follow-up queries using either global or local search modes could improve efficiency. Some queries require broader data access, which can be achieved by leveraging a query router and a lite-global search variant that uses fewer community reports, tokens, and overall resources.

DRIFT search is the first of several major optimizations to GraphRAG that are being explored.  It shows how a global index can even benefit local queries. In our future work, we plan to explore more approaches to bring greater efficiency to the system by leveraging the knowledge graph that GraphRAG creates.

The post Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency appeared first on Microsoft Research.

Read More