GraphRAG: Improving global search via dynamic community selection

GraphRAG: Improving global search via dynamic community selection

The image features three white icons on a gradient background transitioning from blue on the left to green on the right. The first icon, located on the left, depicts a hierarchical structure resembling a workflow with connected squares. The middle icon represents GraphRAG (interconnected nodes and lines). The third icon, on the right, shows a globe with a magnifying glass overlaying it, symbolizing global search.

Retrieval-augmented generation (RAG) allows AI systems to provide additional information and context to a large language model (LLM) when generating a response to a user query. However, traditional RAG-based methods can struggle to retrieve information that requires high-level knowledge of the entire dataset, especially with abstract and global questions such as the keywordless query: “Catch me up on the last two weeks of updates.” These types of queries are known as “global” queries, as they require holistic understanding of the dataset to answer the question. GraphRAG aims to tackle these questions in two main steps: indexing and query. The indexing engine first breaks down a collection of text documents into segments which are then clustered into hierarchical communities with entities and relationships connecting each segment up through higher levels of abstraction. We then use an LLM to generate a summary of each community, known as a community report. The indexing engine thus creates a hierarchical knowledge graph of the dataset, with each level in the hierarchy representing a different level of abstraction and summarization of the original material. In the query step, GraphRAG uses this structured knowledge to provide additional context to the LLM to help answer the question. In this blog post, we show a new method for conducting “global” queries that efficiently utilizes the knowledge graph representation and optimizes the performance of global search in GraphRAG. 

Static vs. dynamic global search

The global search (opens in new tab) algorithm in GraphRAG aims to answer abstract questions that require knowledge of the entire dataset. It generates answers by searching over communities at a predetermined level in the knowledge graph. Then the LLM combines and summarizes all the community reports at this level of abstraction. Finally, the summary is used as additional context for the LLM to generate the response to the user question. This map-reduce process allows the LLM to select relevant text from all the community reports to generate its final answer. This static approach is expensive and inefficient because it includes many lower-level reports that are not informative to the user query. Since it is unlikely that all community reports, especially at a high level, are relevant in answering the query, an approach that first considers the relevancy of the report prior to the resource-intensive map-reduce operation is highly desirable.  

Here, we introduce dynamic community selection to the global search algorithm, which leverages the knowledge graph structure of the indexed dataset. Starting from the root of the knowledge graph, we use an LLM to rate how relevant a community report is in answering the user question. If the report is deemed irrelevant, we simply remove it and its nodes (or sub-communities) from the search process. On the other hand, if the report is deemed relevant, we then traverse down its child nodes and repeat the operation. Finally, only relevant reports are passed to the map-reduce operation to generate the response to the user. Figure 1 illustrates the dynamic community selection process in action. 

An image that shows the workflow of dynamic community selection in global search. Each node illustrates a community report, and the arrow indicates the rate operation.
Figure 1: Dynamic community selection workflow

The dynamic global search approach has two main benefits. First, it prunes irrelevant reports early on, reducing the total number of community reports to be considered in the map-reduce operation. Second, it enables users to search the entire knowledge graph, instead of predefining a static community level, and can lead to more detailed answers. This allows it to collect information at various levels of abstraction. Moreover, the rating operation is a classification problem, which is considerably easier to perform than summarization and text generation, therefore, a less complex model can be used. In our experiments leveraging OpenAI’s models, a GPT-4o-mini rater achieved a very similar retrieval rate as a GPT-4o rater, while operating at a fraction of both cost and time. Overall, we use the smaller and more cost-effective model, GPT-4o-mini, in the rate operation to prune any irrelevant community reports, then we use GPT-4o to perform the map-reduce operation to generate the final response. 

Dynamic community selection on the AP News dataset

To demonstrate the cost saving that dynamic global search brings while maintaining a similar response quality, we evaluated the two methods side by side on a dataset from AP News. We tested static and dynamic search on 50 global questions and assessed the final response quality using an LLM evaluator. Moreover, we compared the total token cost of the two methods. To compare the two methods directly, we constrained the maximum search depth on dynamic global search so that both methods used the same underlying information.

We use an LLM evaluator to select the best response (i.e. win rate) on 3 key metrics: 

  • Comprehensiveness: How much detail does the answer provide to cover all the aspects and details of the question?
  • Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
  • Empowerment: How well does the answer help the reader understand and make informed judgements about the topic?

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.


Significant cost reduction while maintaining output quality

The quality of responses generated from dynamic community selection are comparable to its static counterpart while reducing the total token cost. Our LLM evaluation shows that the output quality of the two methods is similar in the three key metrics across the 50 global questions on the AP News dataset, with no statistical significance between them. More importantly, we observed a significant reduction of total token costs when using the new method, with an average cost reduction of 77% over the existing static global search at community level 1. This is due to the large number of community reports being eliminated via the rating process, thus requiring fewer prompt and output tokens needed in the map-reduce operation. For instance, the existing static global search method processes about 1,500 level 1 community reports in the map-reduce operation, while only 470 community reports on average are selected in dynamic search to generate the final answer.  

Moreover, if we allow dynamic search to continue the rating process further to deeper level community reports, we observe an improvement in its final responses. Here, we conducted the same experiment but allowed dynamic search to continue until community level 3. Out of the 50 global questions, 29 included more community reports than our static search baseline, suggesting that some community reports at deeper levels are relevant to the user question. Indeed, we observed a moderate and statistically significant improvement in both comprehensiveness and empowerment.  Using an LLM evaluator to score pairs of responses, we observe that dynamic global search scores a win rate of 58% and 60%, respectively, against static search at level 1. Nevertheless, while the rating operation is performed by a smaller model and hence induces negligible cost, it can still lead to a higher overall cost due to the increased number of community reports that the map-reduce operation processes. In this experiment, the total cost with dynamic search at level 3 is 34% higher on average. Table 1 summarizes the results of static search at level 1 against dynamic search at level 1 and 3. 

Dynamic search Num. queries Win rate against static search (level 1) Relative cost
Comprehensiveness Diversity Empowerment
Level 1 50 49.5% 46.0% 48.0% -77%
Level 3 29* 58.8% 48.0% 60.0% +34%

Table 1. LLM evaluation on dynamic search vs static search at different community levels on global questions. We show the win rates of dynamic search at level 1 and level 3 against static search at level 1, that is the number of times the LLM evaluator prefers the generated responses from dynamic search over static search. Green background indicates there is statistical significance in the win rates, and grey cells if it is not significant. The right-most column shows the average relative token cost against static search at level 1. * Note that we only evaluate answers from dynamic search at community level 3, which contains more community reports than static search at level 1. 

Overall, dynamic community selection achieved a cost reduction of 77% while achieving output quality similar to the baseline static approach. These appear to be promising numbers, but what of the result quality? Let’s examine some direct side-by-side comparisons with an example: 

Comparable response quality and detail

Query: Across the dataset, describe the common trends in vaccination rates for major diseases.
Static global search at level 1 Global search with dynamic community selection at level 3

### Common Trends in Vaccination Rates for Major Diseases

#### Decline in Vaccination Rates

A significant trend observed across various regions is the decline in vaccination rates for major diseases. For instance, Minnesota has experienced a decrease in immunization rates, leading to recent measles outbreaks [Data: Reports (335)]. Similarly, the COVID-19 pandemic has exacerbated this issue, causing a drop in childhood immunization rates in Michigan and other areas [Data: Reports (1231, 915)]. This decline is often attributed to misinformation about vaccines, particularly the false claim linking the MMR vaccine to autism [Data: Reports (912, 1335)].

#### Impact of COVID-19

The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to increase COVID-19 vaccination rates have been a major focus of public health policies, but the pandemic has also led to a decline in routine vaccinations among children [Data: Reports (920, 194)]. The pandemic has influenced public attitudes towards health measures and vaccinations, contributing to broader public health challenges [Data: Reports (920)].

#### Regional Variations

Vaccination rates vary significantly by region. West Virginia, for example, has one of the strictest school vaccination policies in the U.S., resulting in high child immunization rates [Data: Reports (1199)]. In contrast, Idaho has the highest overall childhood vaccination exemption rate in the U.S., posing significant public health risks [Data: Reports (998)]. Illinois faces challenges with low vaccination rates in schools, leading to measles outbreaks [Data: Reports (108, 82)].

#### Public Health Initiatives

Various public health initiatives have been implemented to address vaccination rates. The CDC plays a crucial role in monitoring and reporting on vaccination rates for diseases such as measles, mumps, and polio, and provides vital information on preventable diseases [Data: Reports (914)]. Programs like the ‘Do It For Babydog’ sweepstakes in West Virginia encourage vaccination among residents [Data: Reports (1199)]. Additionally, the Bill & Melinda Gates Foundation is heavily involved in funding and supporting vaccination programs for major diseases such as polio, malaria, HIV, and cholera [Data: Reports (1091)].

### Conclusion

The dataset reveals a complex landscape of vaccination rates influenced by regional policies, public health initiatives, and the impact of the COVID-19 pandemic. While some regions have managed to maintain high vaccination rates through strict policies and public health campaigns, others face significant challenges due to misinformation and exemption rates. Global efforts and collaborations remain crucial in addressing these challenges and improving vaccination coverage worldwide.

### Common Trends in Vaccination Rates for Major Diseases

#### Decline in Vaccination Rates

A significant trend observed across the dataset is the decline in vaccination rates for various diseases, including measles, mumps, rubella (MMR), and polio. This decline is particularly evident among U.S. kindergartners, where vaccination coverage has dropped from 95.2% during the 2019-2020 school year to 93.1% in the 2022-2023 school year. This reduction falls below the 95% threshold needed to achieve herd immunity, putting approximately 250,000 kindergartners at risk each year [Data: Reports (5668, 15469, 15703, 15481, 15471, +more)].

#### Impact of the COVID-19 Pandemic

The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to promote COVID-19 vaccinations have been a major focus, with initiatives in regions like Michigan and Illinois offering free meals to students and promoting updated COVID-19 shots [Data: Reports (19125, 15466)]. However, the pandemic has also disrupted routine vaccination programs, leading to lower coverage for diseases such as measles and contributing to recent outbreaks [Data: Reports (15464, 15674, 15482, 15481, +more)].

#### Regional Variations and Legislative Measures

There are notable regional variations in vaccination rates and exemption rates across the United States. For example, West Virginia has implemented strict immunization requirements for children entering school, resulting in high child immunization rates [Data: Reports (5674, 18874, 18899)]. In contrast, states like Minnesota and Illinois have seen increases in non-medical exemptions, contributing to lower vaccination rates and recent outbreaks [Data: Reports (15483, 15481, 108, 2705, +more)].

#### Efforts to Improve Vaccination Rates

Various initiatives and legislative measures have been introduced to address declining vaccination rates. For instance, the Government of Sindh introduced a polio vaccination bill that includes provisions for imprisonment for parents who do not vaccinate their children [Data: Reports (15398)]. In the U.S., the CDC has recommended new COVID-19 shots for everyone aged 6 months and older and has launched initiatives to ensure equitable access to vaccines, especially in developing countries [Data: Reports (15847, 15571, 15691, 15694, +more)].

### Conclusion

The dataset reveals a complex landscape of vaccination rates influenced by the COVID-19 pandemic, vaccine hesitancy, misinformation, and regional variations. While efforts to improve vaccination rates are ongoing, the decline in immunization coverage poses significant public health risks, highlighting the need for continued vigilance and proactive measures to ensure high vaccination rates and prevent outbreaks of vaccine-preventable diseases.

Table 2. Generated response from static search (level 1) and dynamic search (level 3) to the same global question on the AP News dataset. 

Table 2 shows an example output from static search at level 1 and dynamic search at level 3 to the same question. While the two outputs contain similar high-level topics, the response from dynamic search provided specific data such as the reduction of vaccination rates in certain demographics. We also notice that the response from dynamic search made significantly more references to the source material, indicated by “[Data Reports]” in the text. By selectively providing information that is relevant to the question, this alleviates the map-reduce operation from having to filter and process all the community reports all at once, and therefore it can generate a response that is more comprehensive and specific to the user question. 

Overall, dynamic community selection proposes an alternative method to perform global search in GraphRAG by leveraging the indexed knowledge graph and the usage of cheaper LLM models in the rate relevancy operation. These changes led to lower total token cost and potential improvements to response detail and quality. 

Availability

You can experiment with dynamic global search on the GraphRAG GitHub repository. 

Dynamic global search is the second of several major optimizations to GraphRAG that are being explored. If you are interested in optimizations for local questions, please check out our recent blog post on DRIFT search. Stay tuned for our upcoming work, where we explore a radically different approach to graph-enabled RAG that is significantly more cost-efficient while improving answer quality for both local and global questions. 

The post GraphRAG: Improving global search via dynamic community selection appeared first on Microsoft Research.

Read More

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

Orca-3 blog - abstract wave graphic

Our work on Orca and Orca 2 demonstrated the power of using synthetic data for the post-training of small language models and getting them to levels of performance previously found only in much larger language models. Orca-AgentInstruct is another step in this direction, where we explore using agentic flows to generate diverse and high-quality data at scale. Orca-AgentInstruct is an agentic solution for synthetic-data generation. By leveraging an agentic framework, AgentInstruct can generate tailored datasets, comprising both prompts and responses, from raw data sources, paving the way to building a synthetic data factory for model fine-tuning.  

The efficacy of this approach is exemplified by the substantial improvement observed by fine-tuning a base Mistral 7-billion-parameter model and using AgentInstruct to generate a 25-million-pair dataset. The fine-tuned model (which we refer to as Orca-3-Mistral) showcases a notable performance gain across multiple benchmarks. For example, it shows 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH, 45% improvement on AlpacaEval, and a 31.34% reduction of inaccurate or unreliable results across multiple summarization benchmarks.

We are making a 1-million-pair subset (orca-agentinstruct-1M) of this dataset publicly available, along with a report describing the data generation procedure, to encourage research on synthetic data generation and finetuning of language models. 

Bar graph comparing scores of the Mistral-Instruct-7B model and the Mistral-7B post-trained AgentInstruct data (Orca-3). The benchmarks are AGIEval, MMLU, BBH, GSM8K, AlpaceEval, FOFO and Mirage-RAG. The graph shows substantial improvement across different benchmarks for the model fine-tuned with AgentInstruct data.
Figure 1: Effect of using AgentInstruct for post-training Mistral-7B. 
The figure shows the three flows used in AgentInstruct: 1) Content Transformation Flow converts the raw seed into an intermediate representation that simplifies the creation of instructions tailored to specific objectives. 2) Seed Instruction Generation Flow, comprising multiple agents, takes as input the transformed seed from the Content Transformation Flow and generates a set of diverse instructions. 3) Instruction Refinement Flow takes as input the instructions from the Seed Instruction Flow and iteratively enhances their complexity and quality.
Figure 2. This figure provides a thematic overview of the roles played by different groups of agents. Content Transformation Flow converts the seed into an intermediate representation that makes it easier to create high-quality and diverse data. Seed Instruction Generation Flow creates instances of the target tasks following a taxonomy. Instruction Refinement Flow explores the space further by starting from these initial data points and exploring the neighborhood. The expectation is that by picking a random seed we will be able to cover the entire region of data points. 

Synthetic Data Accelerated LLM Development: Over the past year, using synthetic data has greatly advanced the training of large language models (LLMs). It sped up model training at all stages, from pre-training (e.g., Phi-3) to instruction-tuning (e.g., Orca and WizardLM) and reinforcement learning from human feedback (e.g., Direct Nash Optimization). 

Generating high-quality synthetic data is hard: On the other hand, research indicates that pre-training models on synthetic data produced by other models can result in model collapse, causing models to progressively degrade. Similar concerns have been raised regarding the use of synthetic data for post-training, suggesting that it might lead to an imitation process where the trained model learns only stylistic features rather than actual capabilities. 

This discrepancy may be attributed to the challenge of generating high-quality and diverse synthetic data.  Successful use of synthetic data involves significant human effort in curating and filtering the data to ensure high quality. 

Synthetic data meets agents: Another major development we witnessed during the past year is the rise of agentic (especially multi-agent) workflows, such as with AutoGen. Agentic workflows can generate high-quality data, which surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration that enable agents to look back at solutions, generate critiques, and improve solutions. They can also use tools like search APIs, calculators, and code interpreters to address LLM limitations. 

Multi-agent workflows bring in additional benefits as well, such as simulating scenarios where we can generate both new prompts and the corresponding responses. They also enable automation of data-generation workflows, reducing or eliminating the need for unnecessary human intervention on some tasks. 

AgentInstruct: Generating synthetic data for post-training or finetuning often relies on an existing prompt set that is either used as is or as seeds for generating more instructions. In this work, we generalize the problem settings to a broader objective of generating an abundant amount of diverse, challenging, and high-quality data to teach a particular skill to an AI model. We refer to this setting as generative teaching.   

AgentInstruct is an agentic solution for generative teaching. AgentInstruct uses raw documents as input to create demonstration and feedback data. When generic data is used as seeds, AgentInstruct can be used to teach an LLM a general capability, such as writing, reasoning, or retrieval-augmented generation (RAG). Domain specific data, like retail or finance, can also be used as seeds to improve the model in a certain specialization. AgentInstruct can create: 

  1. High-quality data: AgentInstruct uses GPT-4, coupled with tools like search and code interpreters, to create high-quality data.  
  2. Diverse data: AgentInstruct creates prompts and responses using a set of specialized agents (with powerful LLMs, tools, and reflection flows) and a taxonomy (of more than 100 subcategories), , ensuring diversity and quality.
  3. Large quantities of data: AgentInstruct can run autonomously. and applyiflows for verification and data filtering. It does not require seed prompts and uses raw documents for seeding. 

Using raw data as seeds offers two advantages: it is plentiful, allowing AgentInstruct to generate large-scale and diverse datasets, and it encourages learning general skills instead of benchmark-specific ones by avoiding using existing prompts.

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.


We anticipate agentic flows becoming increasingly important throughout the model-training lifecycle, including pre-training, post-training, and specialization, and ultimately enabling the creation of a synthetic data factory for model customization and continuous improvement. This has the potential to drive AI advances across multiple industries by making high-quality model training more efficient and accessible. 

Contributors:

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgou, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah

The post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research.

Read More

Abstracts: November 14, 2024

Abstracts: November 14, 2024

Outlined illustrations of Tong Wang and Bonnie Kruft for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Microsoft Senior Researcher Tong Wang joins guest host Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science, to discuss “Ab initio characterization of protein molecular dynamics with AI2BMD.” In the paper, which was published by the scientific journal Nature, Wang and his coauthors detail a system that leverages AI to advance the state of the art in simulating the behavior of large biomolecules. AI2BMD, which is generalizable across a wide range of proteins, has the potential to advance solutions to scientific problems and enhance biomedical research in drug discovery, protein design, and enzyme engineering.

Transcript

[MUSIC]

BONNIE KRUFT: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

I’m Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science and your host for today. Joining me is Tong Wang, a senior researcher at Microsoft. Tong is the lead author of a paper called “Ab initio characterization of protein molecular dynamics with AI2BMD,” which has just been published by the top scientific journal Nature. Tong, thanks so much for joining us today on Abstracts!


TONG WANG: Thank you, Bonnie.

KRUFT: Microsoft Research is one of the earliest institutions to apply AI in biomolecular simulation research. Why did the AI for Science team choose this direction, and—with this work specifically, AI2BMD—what problem are you and your coauthors addressing, and why should people know about it?

WANG: So as Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and the wigglings of atoms.” To study the mechanisms behind the biological processes and to develop biomaterials and drugs requires a computational approach that can accurately characterize the dynamic motions of biomolecules. When we review the computational research for biomolecular structure, we can get two key messages. First, in recent years, predicting the crystal, or static, protein structures with methods powered by AI has achieved great success and just won the Nobel Prize in Chemistry in the last month. However, characterizing the dynamic structures of proteins is more meaningful for biology, drug, and medicine fields but is much more challenging. Second, molecular dynamics simulation, or MD, is one of the most widely used approaches to study protein dynamics, which can be roughly divided into classical molecular dynamics simulation and quantum molecular dynamics simulation. Both approaches have been developed for more than a half century and won Nobel Prize. Classical MD is fast but less accurate, while quantum MD is very accurate but computationally prohibitive for the protein study. However, we need both the accuracy and the efficiency to detect the biomechanisms. Thus, applying AI in biomolecular simulation can become the third way to achieve both ab initio—or first principles—accuracy and high efficiency. In the winter of 2020, we have foreseen the trend that AI can make a difference in biomolecular simulations. Thus, we chose this direction.

KRUFT: It took four years from the idea to the launch of AI2BMD, and there were many important milestones along the way. First, talk about how your work builds on and/or differs from what’s been done previously in this field, and then give our audience a sense of the key moments and challenges along the AI2BMD research journey.

WANG: First, I’d like to say applying AI in biomolecular simulation is a novel research field. For AI-powered MD simulation for large biomolecules, there is no existing dataset, no well-designed machine learning model for the interactions between the atoms and the molecules, no clear technical roadmap, no mature AI-based simulation system. So we face various new challenges every day. Second, there are some other works exploring this area at the same time. I think a significant difference between AI2BMD and other works is that other works require to generate new data and train the deep learning models for any new proteins. So it takes a protein-specific solution. As a contrast, AI2BMD proposes a generalizable solution for a wide range of proteins. To achieve it, as you mentioned, there are some key milestones during the four-year journey. The first one is we proposed the generalizable protein fragmentation approach that divides proteins into the commonly used 20 kinds of dipeptides. Thus, we don’t need to generate data for various proteins. Instead, we only need to sample the conformational space of such dipeptides. So we built the protein unit dataset that contains about 20 million samples with ab initio accuracy. Then we proposed ViSNet, the graph neural network for molecular geometry modeling as the machine learning potential for AI2BMD. Furthermore, we designed AI2BMD simulation system by efficiently leveraging CPUs and GPUs at the same time, achieving hundreds of times simulation speed acceleration than one year before and accelerating the AI-driven simulation with only ten to a hundred millisecond per simulation step. Finally, we examined AI2BMD on energy, force, free energy, J coupling, and many kinds of property calculations for tens of proteins and also applied AI2BMD in the drug development competition. All things are done by the great team with science and engineering expertise and the great leadership and support from AI for Science lab.

KRUFT: Tell us about how you conducted this research. What was your methodology?

WANG: As exploring an interdisciplinary research topic, our team consists of experts and students with biology, chemistry, physics, math, computer science, and engineering backgrounds. The teamwork with different expertise is key to AI2BMD research. Furthermore, we collaborated and consulted with many senior experts in the molecular dynamics simulation field, and they provided very insightful and constructive suggestions to our research. Another aspect of the methodology I’d like to emphasize is learning from negative results. Negative results happened most of the time during the study. What we do is to constantly analyze the negative results and adjust our algorithm and model accordingly. There’s no perfect solution for a research topic, and we are always on the way.

KRUFT: AI2BMD got some upgrades this year, and as we mentioned at the top of the episode, the work around the latest system was published in the scientific journal Nature. So tell us, Tong—what is new about the latest AI2BMD system? 

WANG: Good question. We posted a preliminary version of AI2BMD manuscript on bioRxiv last summer. I’d like to share three important upgrades through the past one and a half year. The first is hundreds of times of simulation speed acceleration for AI2BMD, which becomes one of the fastest AI-driven MD simulation system and leads to perform much longer simulations than before. The second aspect is AI2BMD was applied for many protein property calculations, such as enthalpy, heat capacity, folding free energy, pKa, and so on. Furthermore, we have been closely collaborating with the Global Health Drug Discovery Institute, GHDDI, a nonprofit research institute founded and supported by the Gates Foundation, to leverage AI2BMD and other AI capabilities to accelerate the drug discovery processes.

KRUFT: What significance does AI2BMD hold for research in both biology and AI? And also, what impact does it have outside of the lab, in terms of societal and individual benefits?

WANG: Good question. For biology, AI2BMD provides a much more accurate approach than those used in the past several decades to simulate the protein dynamic motions and study the bioactivity. For AI, AI2BMD proves AI can make a big difference to the dynamic protein structure study beyond AI for the protein static structure prediction. Raised by AI2BMD and other works, I can foresee there is a coming age of AI-driven biomolecular simulation, providing binding free-energy calculation with quantum simulation accuracy for the complex of drug and the target protein for drug discovery, detecting more flexible biomolecular conformational changes that molecular mechanics cannot do, and opening more opportunities for enzyme engineering and vaccine and antibody design.

KRUFT: AI is having a profound influence on the speed and breadth of scientific discovery, and we’re excited to see more and more talented people joining us in this space. What do you want our audience to take away from this work, particularly those already working in the AI for Science space or looking to enter it?

WANG: Good question. I’d like to share three points from my research experience. First is aim high. Exploring a disruptive research topic is better than doing 10 incremental works. In the years of research, our organization always encourages us to do the big things. Second is persistence. I remembered a computer scientist previously said about 90% of the time during research is failure and frustration. The rate is even higher when exploring a new research direction. In AI2BMD study, when we suffered from research bottlenecks that cannot be tackled for several months, when we received critical comments from reviewers, when some team members wanted to give up and leave, I always encourage everyone to persist, and we will make it. More importantly, the foundation of persistence is to ensure your research direction is meaningful and constantly adjust your methodology from failures and critical feedback. The third one is real-world applications. Our aim is to leverage AI for advancing science. Proposing scientific problems is a first step, then developing AI tools and evaluating on benchmarks and, more importantly, examining its usefulness in the real-world applications and further developing your AI algorithms. In this way, you can close the loop of AI for Science research.

KRUFT: And, finally, Tong, what unanswered questions or unsolved problems remain in this area, and what’s next on the agenda for the AI2BMD team?

WANG: Well, I think AI2BMD is a starting point for the coming age of AI-driven MD for biomolecules. There are lots of new scientific questions and challenges coming out in this new field. For example, how to expand the simulated molecules from proteins to other kinds of biomolecules; how to describe the biochemical reactions during the simulations; how to further improve the simulation efficiency and robustness; and how to apply it for more real-world scenarios. We warmly welcome any people from both academic and industrial fields to work together with us to make the joint efforts to push the frontier of this new field moving forward.

[MUSIC]

KRUFT: Well, Tong, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on AI2BMD, you can find a link at aka.ms/abstracts, or you can read it on the Nature website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 14, 2024 appeared first on Microsoft Research.

Read More

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Modular Models blog hero

Today, development of generalizable AI models requires access to sufficient data and compute resources, which may create challenges for some researchers. Democratizing access to technology across the research community can advance the development of generalizable AI models. By applying the core software development concept of modularity to AI, we can build models that are powerful, efficient, adaptable, and transparent. 

Until recently, AI models were primarily built using monolithic architecture. Though powerful, these models can be challenging to customize and edit compared to modular models with easily interpretable functional components. Today, developers employ modularity to make services more reliable, faster to refine, and easier for multiple users to contribute to simultaneously. One promising research direction that supports this involves shifting AI development towards a modular approach (opens in new tab), which could enhance flexibility and improve scalability. 

One such approach is to use numerous fine-tuned models designed for specific tasks, known as expert models, and coordinate them to solve broader tasks (see Towards Modular LLMs by Building and Reusing a Library of LoRAs – Microsoft Research (opens in new tab)Learning to Route Among Specialized Experts for Zero-Shot Generalization (opens in new tab)). These expert models can be developed in a decentralized way. Similar to the benefits of using a microservice architecture, this modular AI approach can be more flexible, cheaper to develop, and more compliant with relevant privacy and legal policies. However, while substantial research has been done on training optimization, coordination methods remain largely unexplored.

Our team is exploring the potential of modular models by focusing on two themes: i) optimizing the training of expert models and ii) refining how expert models coordinate to form a collaborative model. One method for coordinating expert models is to adaptively select the most relevant independently developed expert models for specific tasks or queries. This approach, called MoErging, is similar to Mixture-of-Experts (MoE) approaches but differs in that the routing mechanism is learned after the individual experts are trained. As an initial step, we contributed to creating a taxonomy for organizing recent MoErging methods with the goal of helping establish a shared language for the research community and facilitating easier and fairer comparisons between different methods. 

Assessing existing MoErging methods

Most MoErging methods were developed within the past year, so they don’t reference each and are difficult to compare. To enable comparison of MoErging methods, we recently collaborated on a survey that establishes a taxonomy for comparing methods and organizes MoErging design choices into three steps: 

  • Expert design: Identifies and uses expert models trained asynchronously by distributed contributors. 
  • Routing design: Routes tasks to the appropriate expert models. 
  • Application design: Applies the merged models to specific tasks or domains. 

Each step is broken down into more detailed choices. For example, in expert design, expert training can be custom or standard, and training data can be private or shared. Custom training requires MoErging to have specific training procedures, while the standard training does not. Similarly, shared data means that the training data must be accessible for routing. Otherwise, the training data is considered private. 

The benefits of modular models discussed below assume that training data doesn’t need to be shared. However, a review of current MoErging methods finds that some approaches do require sharing training data, making certain benefits no longer applicable. 

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.


The survey evaluates 29 different MoErging methods using its taxonomy, which categorizes the design choices into two expert design choices, five routing design choices, and two application design options, shown in Figure 1.

Taxonomy of model MoErging design choices. References in the leaf noes link to sections for specific papers that make some particular design choice. We omit references to methods for which a given choice is not applicable.
Figure 1: Taxonomy of model MoErging design choices. References in the leaf nodes link to sections of specific papers that implement each choice. We omit references to methods where a particular choice is not applicable. 

One takeaway from the survey is that most MoErging methods can be grouped into four categories based on their routing design choices:

  1. Classifier-based routing: Methods that train the router as a classifier using expert datasets or unseen data. 
  2. Embedding-based routing: Methods that compute embeddings of expert training sets and compare them to a query embedding for routing. 
  3. Nonrouter methods: Methods that do not explicitly train a router but instead initialize the router in an unsupervised manner.  
  4. Task-specific routing: Methods that learn a task-specific routing distribution over the target dataset to improve performance on a specific task. 

While the differences within each category are minor, the differences across categories are significant because they determine the level of data access required for implementation. As a result, data access is a primary factor in determining which methods are applicable and feasible in various settings. 

Our taxonomy also covers recent approaches to building agentic systems, which could be viewed as specific types of MoErging methods where experts are full language models and routing decisions are made on a step-by-step or example-by-example basis. The optimal level for MoErging may vary depending on the task and the computational resources available to each stakeholder. 

Potential benefits and use cases of modular models 

Modular models can unlock new benefits and use cases for AI, offering a promising approach to addressing challenges in current AI development. Moving forward, further substantial research is needed to validate this potential and assess feasibility.  

Modular AI may: 

  • Allow privacy-conscious contributions.  Teams with sensitive or proprietary data, such as personally identifiable information (PII) and copyrighted content, can contribute expert models and benefit from larger projects without sharing their data. This capacity can make it easier to comply with data privacy and legal standards, which could be valuable for healthcare teams that would benefit from general model capabilities without combining their sensitive data with other training data. 
  • Drive model transparency and accountability.  Modular models allow specific expert models to be identified and, if necessary, removed or retrained. For example, if a module trained on PII, copyrighted, or biased data is identified, it can be removed more easily, eliminating the need for retraining and helping ensure compliance with privacy and ethical standards. 
  • Facilitate model extensibility and continual improvement. Modularity supports continual improvements, allowing new capabilities from expert models to be integrated as they are available. This approach is akin to making localized edits, allowing for continuous, cost-effective improvement. 
  • Lower the barrier to AI development for those with limited compute and data resources. Modular AI can reduce the need for extensive data and compute by creating a system where pretrained experts can be reused, benefiting academics, startups, and teams focused on niche use cases. For example, an AI agent tasked with booking flights on a specific website with limited training data could leverage general navigation and booking skills from other trained AI experts, enabling generalizable and broadly applicable skills without requiring domain-specific training data. We explore this process of transferring skills across tasks in our paper “Multi-Head Routing For Cross-Task Generalization.” 
  • Support personalization.  Modular models make it possible to equip AI agents with experts tailored to individual users or systems. For instance, AI designed to emulate five-time World Chess Champion Magnus Carlsen could enhance a player’s preparation to play a match against him. Experiments suggest that storing knowledge or user profiles in on-demand modules can match or surpass the performance of retrieval-augmented generation (RAG), potentially reducing latency and improving the user’s experience in custom AI applications. 

Current limitations and looking forward 

In this blog, we focused on a type of modular approach that involves training foundation models, which requires substantial compute power and large amounts of data. Despite the advantages of modularity, such as increased flexibility, efficiency, and adaptability, the development of foundation models remains resource-intensive, necessitating high-performance computing and robust datasets to support fine-tuning.  

Recent work has begun to address these challenges by distributing the pretraining process of foundation models (opens in new tab). Looking ahead, a promising research direction focuses on exploring how to create a minimal dataset for training “empty foundation models” while shifting most of their capabilities to external pluggable modules. 

Modular methods are evolving rapidly, and we’re excited by their potential. Modularity has the capacity to democratize AI development, improve model accountability, and support efficient continuous learning. With the MoErging taxonomy, we aim to establish a shared language that fosters engagement within the research community. This research is in the early stages, and we welcome community collaboration. If you’re interested in working with us, please reach out to ModularModels@microsoft.com

Acknowledgements

We would like to thank paper collaborators: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Nabil Omi, Siddhartha Sen, Anurag Sarkar, Jordan T. Ash, Oleksiy Ostapenko, and Laurent Charlin.

The post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.

Read More

Research Focus: Week of November 11, 2024

Research Focus: Week of November 11, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of November 11, 2024

Look Ma, no markers: holistic performance capture without the hassle

Motion-capture technologies used in film and game production typically focus solely on face, body, or hand capture, requiring complex and expensive hardware and lots of manual intervention from skilled operators. While machine-learning-based approaches can overcome these challenges, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts.

In a recent paper: Look Ma, no markers: holistic performance capture without the hassle, researchers from Microsoft introduce a technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. This approach produces stable world-space results from arbitrary camera rigs while also supporting varied capture environments and clothing. The researchers achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. They evaluate their method on a number of body, face, and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets. 


Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge, is a high-impact application. Interest is growing in AI for IT Operations (AIOps), which aims to automate complex operational tasks like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents.  

In a recent paper: Building AI Agents for Autonomous Clouds: Challenges and Design Principles, researchers from Microsoft lay the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. The researchers also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. The paper sets the stage for building a modular and robust framework for building, evaluating, and improving agents for autonomous clouds. 

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

AI-assisted programming offers great promise, but also raises concerns around the trustworthiness of AI-generated code. Proof-oriented languages like F* (opens in new tab) enable authoring programs backed by machine-checked proofs of correctness. Using AI to generate code and proofs in proof-oriented languages helps mitigate these concerns, while also making proof-oriented programming more accessible to people. 

In a recent preprint: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, researchers from Microsoft and external colleagues explore using AI to automate the construction of proof-oriented programs. The researchers curate a dataset of 940,000 lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. The dataset includes around 54,000 top-level F* definitions, each representing a type-directed program and proof synthesis problem. A program fragment checker queries F* to check the correctness of candidate solutions. With this dataset, the researchers explore using AI to synthesize programs and their proofs in F*, finding the performance of fine-tuned smaller language models to compare favorably with LLMs, at much lower computational cost.


One-to-many testing for code generation from (just) natural language

The mostly basic Python programs (MBPP) dataset is commonly used for evaluating natural language models on the task of code generation. Despite its popularity, the original MBPP has two major problems: it relies on providing test cases to generate the right signature and there is poor alignment between “what is asked” and “what is evaluated” using the test cases. 

To address these challenges, in their recent “One-to-many testing for code generation from (just) natural language” paper, researchers from Microsoft introduce the “mostly basic underspecified Python programs” or MBUPP dataset. This dataset adapts MBPP to emphasize the natural language aspect by allowing for some syntactic ambiguity (like not specifying the return type of a function) and evaluating generated code on multiple sets of assertions (like each set covering a different return type). Besides iteratively inspecting LLM results to extend the assertions sets, the researchers carefully remove poor alignment from the instructions (like a specific algorithm to use) and perform a majority vote over slightly paraphrased instructions to improve the quality of the dataset. The researchers compare popular open and closed weight models on the original MBPP and adapted MBUPP datasets to highlight the effect of paraphrasing and new test cases on code generation evaluation.  The MBUPP dataset is publicly available to encourage its use in evaluation code generation models.


The post Research Focus: Week of November 11, 2024 appeared first on Microsoft Research.

Read More

Preventing side-channels in the cloud

Preventing side-channels in the cloud

Icons representing hardware and devices, security, privacy, and cryptography, and systems and networking on a blue to green gradient background.

Cloud computing delivers scalable and cost-effective compute resources to a wide range of customers. The ability for cloud providers to share components of the hardware stack across customers, or tenants, is essential for running efficient cloud systems. For example, modern central processing units (CPUs) pack hundreds of physical hardware threads sharing terabytes of dynamic random-access memory (DRAM), which can be flexibly assigned to many independent virtual machines (VMs).

Preventing tenants from snooping on others who share the same hardware requires security mechanisms. Microsoft Azure (opens in new tab) provides strong protection via comprehensive architectural isolation through access control mechanisms implemented across the cloud platform, including the hardware and the hypervisor. Confidential computing (opens in new tab) powered by trusted execution environments further hardens architectural isolation via hardware memory encryption to protect tenants even against privileged attackers. 

A changing threat landscape

Even with perfect architectural isolation, sharing microarchitectural resources, such as CPU caches and DRAM row buffers, can leak small amounts of information, because interference (due to sharing) leads to variations in the latency of memory accesses. This gives rise to so-called microarchitectural side-channel attacks where a malicious tenant can learn information about another tenant, in the worst case: their cryptographic keys.

Microsoft Azure protects tenants and critical infrastructure against currently practical side-channel attacks. For example, side-channels in on-core resources (e.g., buffers, predictors, private caches) are comprehensively (opens in new tab) mitigated by Hyper-V HyperClear (opens in new tab) via core scheduling, microarchitectural flushing and scrubbing, and virtual-processor address space isolation; and our cryptographic libraries are carefully hardened to prevent any secrets from being leaked via microarchitectural side-channels. 

However, the threat landscape is changing. First, side-channel attacks are becoming increasingly sophisticated: For example, recent academic research (opens in new tab) has shown that even cache-coherence directories can be exploited to leak information across cores. Second, future CPUs are likely to employ increasingly sophisticated microarchitectural optimizations, which are prone to new kinds of attacks: For example, the recently introduced data-dependent prefetchers have already been found to leak information (opens in new tab).

In Azure Research’s Project Venice, we are investigating principled defenses, to be prepared in case such emerging attacks start posing a risk to Azure customers.

Preventing microarchitectural side-channels with resource-exclusive domains

In a research paper (opens in new tab), which has received a distinguished paper award at the ACM Conference on Computer and Communications Security (ACM CCS’24 (opens in new tab)), we present a system design that can prevent cross-VM microarchitectural side-channels in the cloud. Our design provides what we call resource-exclusive domains, which extend the architectural abstraction of private physical threads and private memory to the microarchitectural level. That is, resource-exclusive domains guarantee isolation even against powerful attackers that try to mount side-channel attacks on shared microarchitectural resources.

Our approach builds on isolation schemes, a novel abstraction of the way a CPU shares microarchitectural structures between its physical threads.  Isolation schemes can be used by the hypervisor and host operating system to assign physical threads and physical memory pages, eliminating the risk of information leakage across resource-exclusive domains. Technically, for a given assignment of physical threads to resource-exclusive domains, the isolation scheme partitions each microarchitectural resource that is shared between domains (as this would leak information), but without partitioning resources that are private to a domain (as this would affect performance). We achieve this using hardware mechanisms, if available, and multi-resource memory coloring, if not.

In a complementary research paper (opens in new tab) (appearing at ACM CCS’24 (opens in new tab)), we provide the theoretical foundations and practical algorithms for computing such multi-resource memory coloring schemes for existing microarchitectures, as well as design patterns for future microarchitectures to support a large number of resource-exclusive domains. 

We have implemented our approach in a research prototype based on Microsoft Hyper-V for a modern cloud chiplet-based CPU, AMD EPYC 7543P, that supports VM-level trusted execution environments. Using a collection of microbenchmarks and cloud benchmarks, we demonstrate that our approach eliminates all identified side-channels and incurs only small performance overheads. For example, when allocating resources at chiplet and channel granularity (i.e., coupling a chiplet with one of the local DRAM channels) we observe an overhead of less than 2%; and only up to 4% when allocating resources at chiplet granularity and coloring with 2MB pages.

Co-designing cloud platforms for future microarchitectural isolation

To validate the effectiveness and practicality of our approach, we inferred isolation schemes for a single CPU by reverse-engineering its microarchitecture. This approach is incomplete and does not scale to the diverse hardware fleet available in the cloud. We are working with CPU vendors to develop isolation schemes for future CPUs, which will then be exposed via the hardware interface for consumption by the hypervisor’s hardware abstraction layer. In this way, we will be able to reap the benefits of microarchitectural performance optimizations while continuing to provide strong security guarantees to cloud tenants. 

Additional Contributors

Cédric Fournet, Senior Principal Researcher
Jana Hofmann, Researcher
Oleksii Oleksenko, Senior Researcher

The post Preventing side-channels in the cloud appeared first on Microsoft Research.

Read More

Collaborators: Prompt engineering with Siddharth Suri and David Holtz

Collaborators: Prompt engineering with Siddharth Suri and David Holtz

Illustrated images of Siddharth Suri and David Holtz. “Collaborators: A Microsoft Research Podcast” runs along the bottom.

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

How significant will prompt engineering be as generative AI models continue to advance? After previous successful collaborations, Siddharth Suri, a Microsoft senior principal researcher, and David Holtz, an assistant professor at the University of California, Berkeley and a former intern of Suri’s, reunited to address the debate with data. In this episode, they discuss their study of how prompting approaches change as models advance. They share how the work required finding a variety of additional perspectives in what they describe as an Ocean’s Eleven-style recruitment effort; why mastering chain-of-thought prompting and other specialized methods might not be a prerequisite for getting what you want from a model; and, for aspiring researchers, what some butterflies can tell you about the types of challenges you’re pursuing. Suri and Holtz’s work is part of the Microsoft Research initiative AI, Cognition, and the Economy, or AICE, and is supported by the Microsoft Research initiative Accelerate Foundation Models Research, or AFMR.

Transcript

[TEASER] [MUSIC PLAYS UNDER DIALOGUE]

SIDDHARTH SURI: So, it’s, like, just before Thanksgiving 2020. My manager came to me, and she was like, Sid, we need somebody to understand, what are the effects of AI on society? And I was like, “Oh, yeah, small question! Yeah, I can do that by myself! Yeah. I’ll get you an answer by Tuesday,” OK? I felt like I was dropped in outer space, and I had to find Earth. And I didn’t even … I couldn’t even see the sun. Like, I … there was this entirely new system out there. No one knew how to use it. What are the right questions to ask? We were using the system to study how people use the system? Like, what the heck is going on?

DAVID HOLTZ: And I remember thinking, this seems like the most important thing that a person could be working on and studying right now. Like, anything else that I’m working on seems unimportant in comparison to the impact that this technology is poised to have on so many different facets of, you know, life and the economy and things like that.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.


[MUSIC FADES]

Today I’m talking to Dr. Siddharth Suri, also known as Sid, who’s a computational social scientist and a senior principal researcher at Microsoft Research. With him is Dr. David Holtz, an assistant professor in the Haas School of Business at the University of California, Berkeley. Sid and David are co-leading a team of researchers who are exploring the fascinating world of prompt engineering as part of the AI, Cognition, and the Economy, or AICE, initiative at Microsoft Research. I can’t wait to get into the meat of this research, but before we do, let’s meet our researchers. Sid, you first!

SIDDHARTH SURI: Hey, Gretchen, thanks for having me.

HUIZINGA: Tell us about yourself. At what intersection do your research interests lie, and what path led you to what you’re doing at Microsoft Research today?

SURI: So I got to where I am now through a very long and circuitous route, and I’ll give you the sort of CliffsNotes version of it, if you will. If you start back in grad school, my dream was to become a theoretical computer scientist. And what that basically means is writing algorithms. And what that basically means is pushing Greek symbols around a page. [LAUGHTER] And it turns out I’m good at that, but I’m not great at that. And towards the end of grad school, I was working with another professor, and he was doing these experiments that involved humans, and what we would do is we bring undergraduates into a lab. They were sitting in front of a computer using our software. We’d arrange them in different networks, so you’re trying to solve a problem with the people who are next to you in this network. And then we would change the structure of that network and have them solve the problem again. And we would try to understand, how does the structure of this network affect their ability to solve this problem? And I remember analyzing this data. I just was swimming around in this data and having a grand old time. I … nights, weekends … I remember riding the bus to school in Philadelphia, and I was trying to think about new analyses I could do. And it was just so … it was fun. I couldn’t get enough. And I remember my adviser talking to me one day, and he’s like, Sid, you’re really good at this. And I responded with, really good at what? I’m just doing the obvious thing that anybody would do. And he was like, bro, this is not obvious. Like, you know, you got a knack for this. And then that, sort of, set me on this path, and then, just to make a little long story short, I don’t have tons of self-awareness. So it took me like 10 full years to go from, like, deciding to hang up being a theoretical computer scientist and understanding humans, human behavior, and using technology to understand human behavior. And that’s, kind of, where I ended up as a computational social scientist. I’ve sort of gone all in in that space, as a computational social scientist. And that’s how David and I met. He’s a rising star in that space, as well. He became my intern. And that’s how we met. I’ll let him share his origin story with you.

HUIZINGA: Well, let’s do, David. I noticed you have a strong science background, but now you’re an assistant professor in a business school. So you got to do a little dot-connecting here. How did a guy with a degree in physics and astronomy—and should I also mention theater and dance? I’m so intrigued—um, how did that guy wind up working with MBAs and economists?

DAVID HOLTZ: Yeah, thanks for having me, Gretchen. Similar to Sid, my path to where I am today is also long and circuitous, and I will try to give you the CliffsNotes version. When I was young, I was always super interested in physics, and I think what drew me to physics was the way that it combined math, which I was very good at when I was younger, and the ability to answer big existential questions. Where does the universe come from? What’s the universe made out of? Is it growing? Is it shrinking? Things like that. And so when I went to college, I didn’t think too deeply about what I was going to study. I just, sort of, you know, always wanted to do physics. I’m going to do physics. And so I majored in physics. And then … I did my undergrad at Princeton, and there’s something about the physics department at Princeton where it’s almost just assumed everyone’s going to go get their PhD. And so there was a lot of “ambient pressure” to apply to graduate school. And so I actually started my physics PhD at Johns Hopkins. And as a PhD student, I was working on these large telescopes that look at remnant light from right after the Big Bang and try to characterize, you know, tiny fluctuations in this field of light that fills the night sky in a wavelength-like range that is not visible to the human eye. And by, sort of, characterizing those fluctuations in the light field, you can learn things about what the universe is made out of and how it’s evolving and all these types of things. It all sounds very cool. But the teams that conduct this research at this point are really big. It’s like you’re in a company, essentially. So there’s a hundred people working on building this telescope, analyzing these telescopes, so on and so forth. And so the actual day to day of my life as a physics PhD student was really far removed from the big existential questions that I was actually really interested in. My PhD dissertation probably would have been developing a system that moved a mirror in exactly this way so that light polarization appears, you know, in the experimental apparatus. You’re basically doing an engineering degree. And on top of all that, like Sid, I was good at physics, but I think I realized I was not great at physics. And I saw a lot of people around me in my classes and in my labs that were great at physics and moreover were having a really hard time finding a job as a physics professor after they graduated despite being great at physics. And so I started having these realizations during graduate school and had never done anything really except physics and so took a leave of absence and actually came out to the Bay Area and started working out here in advertising, which is not something that I’m necessarily super excited about—and as a product manager, which is not what I do. But it was kind of the hop that I needed to try something different. And after some amount of time, moved from doing product management to doing data science. This was right when the data science boom was starting. I think the year that I came to the Bay Area, DJ Patil, who used to be the chief data scientist for the US, had written this very famous HBR article about, you know, how data science was the sexiest job of the 21st century …

HUIZINGA: Right!

HOLTZ: … so I, kind of, took my physics credentials and became a data scientist and eventually also moved out of advertising and went and worked at Airbnb, which at the time was growing really quickly and, you know, was sort of a young company where a lot of exciting things were happening. You know, I loved working at Airbnb. I learned a lot. I met a lot of interesting people. I learned a lot working in ad tech, as well, and eventually just found myself feeling pulled back to academia. Like, I really liked the questions that I was working on, the types of work that I was doing. Similar to Sid, I found that I was really good at analyzing data. I didn’t feel like I was doing anything particularly crazy, but people around me were saying, no man, you’re really good at this! And so I started looking for PhD programs where I could do the type of work that I was doing as a data scientist at Airbnb but in a more academic environment. And that, sort of, naturally led me to PhD programs in business schools. I didn’t know what a PhD in a business school entailed, but there were professors in those departments that were doing the research that I wanted to do. And so that’s how I ended up there. And so my research when I started out as a PhD student was, I think, relative to a lot of people, I didn’t start from, like, first principles. I don’t know that I necessarily had this one little thing that I was super interested in. I was really interested in solving applied problems and, in particular, I think some of the applied problems that I had seen out in the world working in tech. And over time, I think I found that I’m just really interested in new technologies and how those technologies affect, you know, the flow of information, how people collaborate, what happens to the economy, so on and so forth. And so I sort of started by just trying to answer a few problems that were in front of me and discovered this was kind of, you know, sort of the unifying theory of the things that I was interested in studying. And I think … you know, in hindsight, I think one thing that is true that has kind of guided, you know, my path—and this connects back to the theater and dance, you know, minor that you had alluded to earlier—is I’ve always been a really social person. I’ve always been really interested in humans and how they interact. I think that type of storytelling is really at the crux of, you know, theater and music and things like that. And when I was younger, for sure, I spent a lot of time writing music, playing music, doing improv comedy, performing on stage. And as a physicist, that itch wasn’t necessarily getting scratched, both because I was just studying, you know, extremely small particles and was doing it in a pretty lonely lab. And a nice thing about being a computational social scientist is that I’m studying humans, which is really interesting. I think it plugs into something that I’m really passionate about. And a cool thing about getting to do that in particular in a business-school setting, I think, is that, you know, I’m talking often to people at companies and, you know, lecturing to MBA students, who are really outgoing, gregarious people. And so it presents a really nice opportunity to, kind of, fuse, you know, my interest in science and information and technology with that other interest in humans and connection and, you know, the opportunity to, sort of, interact with people.

HUIZINGA: Yeah, yeah. Well, escaping from middle management in physics is probably a good thing … Well, before we get into the details of your collaboration on prompt engineering, let’s make sure everyone knows what we’re talking about. Sid, when we talked before, I told you, to be honest, when I first heard the phrase “prompt engineer” a couple years ago, I laughed because I thought it was a joke, like sanitation engineer. Then when I heard it was a real job, I laughed a little bit less. And then when I heard it was not only a real job but one that, if you were good at it, could pay six figures, I stopped laughing altogether and started paying attention. So I’d like you, Sid, to give us a brief history of prompt engineering. What is it, when and how did it become a thing, and why is it different from anything I’d do in garden-variety internet search?

SURI: So generative AI wants to do just that. It wants to generate something for you. But how do you express what you want? What do you want the system to give you? And the answer is a prompt. So I’ll give you an example. Whenever there’s a new model out there, especially one that generates images, a prompt I use—you might laugh at this—is, “Show me a picture of Bruno Mars on the surface of Mars eating a Mars bar.” [LAUGHTER] And the reason why I use that prompt is because Mars bars aren’t in the training data. There’s not a lot of pictures of Mars in the training data. And everybody knows who Bruno Mars is. So that’s me describing to the model what I want. That is a prompt. Show me a picture with these elements in it, OK? But this is where the hard part starts. It sends you something. Oh. I didn’t want Mars to be that color of red. Could you change it to a deeper red or more of an orange? OK. Now, could you put a little dust in the atmosphere? OK. Well, I want a moon in the background. I didn’t know I wanted a moon in the background, but now I do. Where’s the sun in this image? I don’t know. And then the whole thing, kind of, becomes much more rich and a much bigger exploration compared to, say, putting keywords into a search engine. It’s a really much more rich space to explore. Now you asked me … a part of your question was, why is prompt engineering difficult? It’s difficult for a number of reasons. Number one, you don’t always know what you want.

HUIZINGA: Yeah …

SURI: And so it’s that conversation with the system to figure that out. Number two, you might not be expressing what you want as clearly as you think you are.

HUIZINGA: Right …

SURI: Number three, the problem could be on the receiver end. These models are new. You might be expressing it clearly, but they might not be understanding what you’re saying as clearly as you would hope. And then the fourth reason is the one I just said, which is, like, what you’re asking for is not just like, “Give me a document relevant to these keywords,” or “Give me some information relative to these keywords,” as you would do in traditional search. You’re asking for something much more rich. And to get that richness that you were hoping for requires this prompt. And that requires an exploration of the idea in your head and an expression of that idea in the real world. So that’s what prompt engineering is, and that’s why it’s hard.

HUIZINGA: OK, and when would you say it became a thing? I mean, prompt engineer is an actual job, but it was a thing first, right? It didn’t start out to be a job; it started out to be something you did, so …

SURI: So when these models came out, you know, what was it, late, around 2020, late 2020, I think, when they first started becoming popular. So prompting had been around in academia a few years prior to that, but it first hit the mainstream when these models, sort of, first came out around 2020, and why … why this job? Why this six-figure salary? Why all … what’s all the hoopla about it? And like I said before, these systems are new. No one knew how to use them. No one knew how to express what they want, A. B, there’s a lot of arcane ways to prompt that aren’t obvious at the beginning. Like, I’ll give you a few examples. One way to prompt is to give the system examples of what you’re looking for. Say you want something to classify an email as spam or not spam. You might give it a few emails that are spam and a few emails that are not spam and say, hey, if it’s more like this, call it spam; if it looks more like that, call it not spam. And so that’s one example. Another example would be like, OK, I’m a small-business owner. I need some advice. This is the problem I’m facing. Give me some advice to solve this problem as if you were Bill Gates.

HUIZINGA: Oh …

SURI: That’s, like, adopting a persona. That’s another example. A third example would be, like, OK, you have a math problem. You’re trying to solve this math problem, and to get it done correctly, some of these systems need what’s known as chain-of-thought prompting, which is tell me all the steps you’re going through to solve this problem. Don’t just give me the answer 17. Give me all the steps you needed to get to 17. And that helps the system guide it, more likely, towards a correct answer. And so these are all arcane, esoteric methodologies to getting one of these models to give you the right answer, the answer you want. And being a prompt engineer means you’re an expert in these things and you’re more likely to get these correct answers than maybe someone off the street who isn’t familiar with these techniques.

HUIZINGA: Right, right, right. Well, we’re going to talk a lot more about technique and the research that you did. And you’ve alluded to, at the beginning here, a visual, like describing … I heard graphic designers hearing the client when you were talking about it: “I didn’t want that red. Maybe put the moon in …” [LAUGHS]

SURI: Yeah, exactly!

HUIZINGA: Can you just tell me what you want to begin with? No, apparently not. But you’re also talking about verbal prompts and writing and so on. So we’ll get into that in a bit. But I want to go over and talk a little bit more about this research and why it’s where it is. This episode is the latest in our “series within a series” on AI, Cognition, and the Economy at Microsoft Research. And so far, we’ve talked about the impacts of AI on both cognition with Abi Sellen and the economy with Mert [Demirer] and Brendan [Lucier]. You can look up those episodes, fantastic episodes. This topic is a little less obvious, at least to me. So, David, maybe you could shed some light on how research for prompt engineering became part of AICE and why it’s an important line of research right now.

HOLTZ: So I think this project relates to both cognition and the economy. And let me lay out for you the argument for both. So first, you know, I’m not a cognitive scientist, but I think there are some interesting questions around how people, and in particular common people who are not computer scientists, conceive of and interact with these models, right. So how do they learn how to prompt? Do they think about different generative models as all being the same, or are they sort of developing different prompting strategies for different models? What are the types of tricks that they discover or use when they’re prompting models? And at the time that we started working on this project, there wasn’t a lot of research on this and there wasn’t a lot of data on this. You know, the data that existed typically is on the servers of big companies like Microsoft. It’s not really available to the public or to many researchers. And then the research is all, you know, sort of disproportionately focused on these esoteric prompting strategies that Sid mentioned, like chain-of-thought prompting, which are useful but are not things that, you know, my family members that are not scientists are going to be using when they’re trying to interact with, you know, the latest large language model that has been launched. So that was one draw of the project. The other thing that I think is interesting—and the reason that this project was well-suited to the AICE program—is that around the time that we were starting to work on this project, a bunch of research was coming out, and I’ve contributed to some of this research on a different project, on the impacts that generative AI can have on different economic outcomes that we care about. So things like productivity and job performance. And one interesting pattern that has emerged across numerous different studies trying to answer those types of questions is that the benefits of generative AI are often not uniform. Usually, generative AI really helps some workers, and there are other workers that it doesn’t help as much. And so there’s some interesting questions around why is it that some people are able to unlock big productivity gains using generative AI and others can’t. And one potential reason for this is the ways that people prompt the models, right. So I think understanding how people are actually interacting with these models when they’re trying to do work is a big part of understanding the potential impact that these models can have on the economy.

HUIZINGA: OK, it’s “how I met your mother” time. Let’s talk for a minute about how you two came to be working, along with what you’ve referred to as a “crack team” of researchers, on this study. So, Sid, why don’t you tell us, as you remember it, who called who, how the conversation went down, and who’s all involved. And then David can confirm, deny, or add color from his perspective.

SURI: OK, I need you to mentally rewind back to, like, November 2020. So it’s, like, just before Thanksgiving 2020. My manager came to me, and she was like, Sid, we need somebody to understand, what are the effects of AI on society? And I was like, “Oh, yeah, small question! Yeah, I can do that by myself! Yeah. I’ll get you an answer by Tuesday,” OK? Like, what the heck, man? That was like one of the biggest questions of all time. The first thing I did was assemble a team. We write an agenda; we start going forward from there. You know, Scott Counts is a colleague of mine; he was on that team. Not long after that … as I had mentioned before, David was my intern, and he and I started brainstorming. I don’t remember who called who. Maybe David does. I don’t remember that. But what I do remember is having several fun, productive brainstorming conversations with him. I remember vividly, I was, like, sort of walking around my house, you know, upstairs, kind of, trying to bounce ideas off of him and get the creative juices flowing. And one of the things we were talking about was, I just felt like, again, this is early on, but prompting is the thing. Like, everybody’s talking about it; nobody knows how to do it; people are arguing. So David and I were brainstorming, and then we came up with this idea of studying prompting and how prompting changes as the models get better and better, which they are, at a torrid rate. And so that was our, sort of, key question. And then David actually was primarily involved in assembling the crack team, and he’s going to talk more about that. But as a side note, it’s really cool for me to see David, kind of, grow from being, you know, just a great, sort of, individual scientist to, like, the leader of this team, so that was, kind of, a cool thing for me to see.

HUIZINGA: Hmm. You know, you tell that story … Peter Lee, who’s the president of Microsoft Research, tells a similar story where a certain CEO from a certain company came and dropped him in the middle of the AI and healthcare ocean and said find land. So did it have that same sort of “overwhelmed-ness” to it when you got asked to do this?

SURI: Overwhelmed would be an understatement! [LAUGHTER] It was overwhelming to the point where I was borderline afraid.

HUIZINGA: Oh, dear!

SURI: Like, you know, Peter has this analogy you mentioned, you know, “dropped in the ocean, find land.” I felt like I was dropped in outer space and I had to find Earth. And I didn’t even … I couldn’t even see the sun. Like, I … there was this entirely new system out there. No one knew how to use it. What are the right questions to ask? We were using the system to study how people use the system? Like, what the heck is going on? This was, like, stress levels were on 12. It was a sort of wild, white-knuckle, anxiety-inducing, fun, intense ride. All of those emotions wrapped up together. And I’m happy it’s over [LAUGHS] because, you know, I don’t think it was sustainable, but it was an intensely productive, intensely … again, just in case there’s any budding scientists out there, whenever you’re like swimming around in a problem and your gut is a little scared, like, I don’t know how to do this. I don’t know if I’m doing this right. You’re probably working on the right problem. Because if you know how to do it and you know how to do it right, it’s probably too easy.

HUIZINGA: Yeah!

SURI: And in this moment, boy, my gut was telling me that nobody knows how to do this and we got to figure this out.

HUIZINGA: Right. David, from your theater background, did you have some of these same emotions?

HOLTZ: Yeah, I think so. I think Sid and I, it’s interesting, we have different perspectives on this kind of interesting generative AI moment. And to use the theater analogy, I think being, you know, like, a researcher at Microsoft, Sid has kind of been able, the whole time, to see behind the curtain and see everything that’s going on. And then as someone that is, you know, a researcher in academia, I’ve sort of been in the audience to some extent. Like, I can see what’s coming out onto the stage but haven’t seen all the craziness that was happening behind the curtain. And so I think for me, the way that I would tell the story of how this project came together is, after I had finished my internship and Sid and I—and a number of coauthors—had this very successful remote work paper, we just kept in touch, and every few weeks we’d say, hey, you know, want to chat, see what we’re both working on, swap research ideas?

HUIZINGA: Yeah …

HOLTZ: And for me, I was always looking for a way to work together with Sid. And if you look around at, you know, the history of science, there’s these Kahneman and Tversky, like, Watson and Crick. Like, there are these teams that stay together over long periods of time and they’re able to produce really amazing research, and so I realized that one thing that I should prioritize is trying to find people that I really like working together, that I really click with, and just trying to keep on working with those people. Because that’s one of the keys to having a really successful career. At the same time, all this generative AI stuff was happening, and I went to a few talks. One of them was on the Berkeley campus, and it was a talk by someone at Microsoft Research, and it was about, sort of, early signs of how amazing, you know, GPT-4 was. And I remember thinking, this seems like the most important thing that a person could be working on and studying right now. Like, anything else that I’m working on seems unimportant in comparison to the impact that this technology …

HUIZINGA: Wow …

HOLTZ: … is poised to have on so many different facets of, you know, life and the economy and things like that. And so I think things kind of came together nicely in that there was this opportunity for Sid and I to work together again and to work together again on something that we both agreed was just so incredibly important. And I think we realized this is really important. We really want to work on this problem. But we’re also both super busy people, and we don’t necessarily have all the skills that we need to do this project. And given how important this question is and how quickly things are moving, we can’t afford to have this be a project where it’s like, every now and then … we come back to it … maybe we’ll have a paper in, like, three years. You know, like, things needed to happen really quickly. And so that’s where we got to thinking, OK, we need to put together a team. And that’s kind of where this, like, almost, like, Ocean’s Eleven, sort of, scene emerged [LAUGHTER] where we’re like, we’re putting together a team. We need a set of people that all have very particular skills, you know, and I’m very lucky that I did my PhD at MIT in this sort of community that is, I would say, one of the highest concentrations of really skilled computational social scientists in the world, basically.

HUIZINGA: Wow.

HOLTZ: And so I, sort of, went to, you know, to that community and looked for people. I reached out to people that I had met during the PhD admissions program that were really promising, you know, young PhD students that might want to work on the project and, sort of, put the team together. And so this project is not just Sid and I. It’s six other people: Eaman Jahani, Ben Manning, Hong-Yi TuYe, Joe Zhang, Mohammed Alsobay, and Christos Nicolaides. And everyone has brought something unique and important to the project. And it’s really kind of crazy when you think about it because on the one hand, you know, sometimes, when we’re talking, it’s like, wow, eight people. It’s really a lot of people to have on a paper. But at the same time, you, kind of, look at the contributions that every single person made to the project and you, kind of, realize, oh, this project actually could not have happened if any one of these people were not involved. So it’s been a really interesting and fun project in that way.

SURI: One thing I just wanted to add Gretchen is, I’m a little bit older than David, and when I look back at my career and my favorite projects, they all have that property that David was alluding to. If you knocked one of the coauthors off that project, it wouldn’t have been as good. To this day, I can’t figure out why is that so important, but it is. It’s just this notion that everyone contributed something and that something was unique that no one else would have figured out.

HUIZINGA: Well, and the allusion to Ocean’s Eleven is exactly that. Like, they have to get someone who can crack a safe, and they have to get someone who’s a contortionist and can fit into a box that no one can see, and blah, blah, blah. And I don’t know if you’ve argued about which one of you is George Clooney and which one of you is Brad Pitt, but we’ll leave that for a separate podcast.

SURI: Well, actually … [LAUGHTER] it’s not even a question because Eaman Jahani is by far the most handsome one of us, so he’s Brad Pitt. It’s not even close. [LAUGHS]

HUIZINGA: David’s giggling!

HOLTZ: Yeah, I think Sid … I’d agree with that. I think Sid is probably George Clooney.

SURI: I’ll take it. I’ll take it!

HUIZINGA: Anytime! Well, we’ll talk about some more movies in a minute, but let’s get into the details of this research. And, Sid, I was looking at some of the research that you’re building on from your literature, and I found some interesting papers that suggest there’s some debate on the topic. You’ve just alluded to that. But let’s talk about the titles: AI’s hottest job: Prompt engineer, and, like, Tech’s hottest new job: AI whisperer. No coding required. But then there’s this Harvard Business Review article titled AI prompt engineering isn’t the future. And that left me wondering who’s right. So I suspect this was part of the “prompting” for this research. Tell us exactly what you did and how you did it.

SURI: Sure, so where we came to this question was, we came at it from a couple directions. One is what you just said. There’s this conversation going on in the public sphere, which is on the one hand, there’s these jobs; there’s this notion that prompting, prompt engineering, is a super important thing; it’s paying six figures. On the other hand, there’s also this notion that these models are getting better and better. They’re more able to figure out what you needed and guess what you needed and so maybe we’re not going to need prompting going forward.

HUIZINGA: Right.

SURI: And David and I were like, this is perfect. One of my mentors, Duncan Watts, I always joke with him that every introduction of our paper is the same. It’s “There’s this group of people that say x, and there’s this group of people that say the opposite of x. So we did an experiment to figure it out.” And the reason why every introduction of one of my papers is the same is because you can never say at the end it was obvious. If it was so obvious, then how come there’s two groups of people disagreeing on what the outcome’s going to be? So what we did in the experiment—it’s very simple to explain—is we gave people a target image, and then they randomly either got DALL-E 2 or DALL-E 3. And we said, “OK, write a prompt to generate this target image that we’ve given you,” and we give them 10 tries. “And you can iterate; you can improve; you can experiment. Do whatever you want.” And the notion was, as models progress, what is the relationship between people’s ability to prompt them to get to the target?

HUIZINGA: That’s the end of it. [LAUGHS]

SURI: Yeah. [LAUGHS]

HUIZINGA: That’s the most succinct explanation of a research study that I’ve ever heard. Congratulations, Sid Suri! So I have a question, and this is like … you’ve talked a bit already about how you iterate to get to the target image. My experience is that it can’t remember what I told it last time. [LAUGHTER] So if I put something in and then I say, well, I want you to change that, it starts over, and it doesn’t remember what color red it put in the first image. Is that part of the process, or are these models better than what I’ve done before?

SURI: The models are changing, and that is … and, sort of, the history, the context, the personalization is what you’re referring to. That is coming online in these models already and in the near future. Maybe at the time we did the study, it wasn’t so common. And so they were suffering the same issue that you just alluded to. But going forward, I do expect that to, sort of, fade away a little.

HUIZINGA: OK. Well, David, Sid’s just given us the most beautifully succinct description of people trying to get the model to give them the target image and how many tries they got. What did you find? What were the big takeaways of this research?

HOLTZ: So let me start out with the most obvious finding that, you know, like, Sid was saying, ideally, you know, you’re, kind of, answering a question where it makes sense that people are on both sides of this argument. One thing that we looked at that you’d be surprised if there was someone on the other side of the argument is, OK, do people do a better job when we give them the better model? If we give them DALL-E 3 instead of DALL-E 2, do they do a better job of re-creating the target image? And the answer is unsurprisingly, yes. People do a better job when we give them the better model. The next thing that we looked at—and this is where I think the results start to get interesting—is why do they do better with the better model? And there’s a couple of different reasons why this can be the case. The first could be that they’re writing the exact same prompts. They interact with the model exactly the same, whether it’s DALL-E 2 or DALL-E 3, and it’s just the case that DALL-E 3 is way better at taking that input and translating it into an image that is the image that you had in mind with that prompt. So, you know, sort of, imagine there’s two different artists. One is like a boardwalk caricature artist; the other one is Vincent van Gogh. Like, one of them is probably going to be better at taking your input and producing a really high-quality image that’s what you had in mind. The other possibility is that people, sort of, pick up on the fact that one of these models is different than the other. Maybe it’s more expressive. Maybe it responds to different types of input differently. And as you start to figure that out, you’re going to actually prompt the model, kind of, differently. And so I think the analogy I would draw here is, you know, imagine that you’re driving a couple of different cars maybe, like, one has really nice power steering and four-wheel drive and things like that. The other one doesn’t have all these cool features. You know, you’re probably going to actually handle that car a little bit differently when you take it out on the road relative to a really simple car. And what we find when we actually analyze the data is that both of these factors contributes to people doing better with the higher-quality model. And they actually both contribute equally, right. So insofar as people do better with DALL-E 3, half of that is because DALL-E 3 is just a better model at, like, taking the same input and giving you, like, an image that’s closer to what you had in mind. But the other half is due to the fact that people, sort of, figure out on their own, oh, this model is different. This model is better. It can maybe respond to my inputs a little bit more expressively. And they start prompting differently. And one thing that’s really neat and interesting about the study is we didn’t tell people whether they were given DALL-E 2 or DALL-E 3. So it’s not even like they said, oh, you gave me the good model. OK, let me start prompting differently. They kind of just figure this out by interacting with the tool and kind of, you know, realizing what it can do and what it can’t do. And specifically when we look at what people are doing differently, they’re, kind of, writing longer prompts; they’re writing more descriptive prompts. They have way more nouns and verbs. They’re kind of doing less feeling around in the dark and kind of finding, like, a way of interacting with the model that seems to work well. And they’re kind of doubling down on that way of interacting with the model. And so that’s what we saw. And so when it connects back to your question of, you know, OK, prompt engineering, like, is it here to stay, …

HUIZINGA: Yeah.

HOLTZ: … or is prompt engineering going away? I think one way that we think about interpreting these results is that the prompts do matter, right. Like, if you didn’t think about how to prompt different models and you just wrote the same prompts and left that prompt “as is” for, you know, months or years, you’d be missing out on tons of the gains that we stand to experience from these new, more powerful models because you need to update the prompts so that they take advantage of the new model capabilities. But on the flip side, it’s not like these people needed to, you know, go read the literature on all these complicated, esoteric prompting strategies. They kind of figured it out on their own. And so it seems like prompting is important, but is it necessarily prompt engineering, where it’s this really, you know, heavy-duty, like, thing that you need to do or you maybe need to go take, like, a class or get a master’s degree? Maybe not. Maybe it’s just a matter of people interacting with the models and, kind of, learning how to engage with them.

HUIZINGA: Well, David, I want to ask you another question on that same line, because AI is moving so fast on so many levels. And it’s still a relatively new field. But now that you’ve had some time to reflect on the work you just did, is there anything that’s already changed in the conversation around prompt engineering? And if so, what are you thinking about now?

HOLTZ: Yeah. Thanks for the question. Definitely things are changing. I mean, as Sid mentioned, you know, more and more the way that people interact with these models, the models have some notion of history. They have some notion of context. You know, I think that informs how people are going to write prompts. And also, the types of things that people are trying to do with these models is constantly changing, right. And so I think as a result, the way that we think about prompting and, sort of, how to construct prompts is also evolving. So I think the way that we think about this study is that it’s by no means, you know, the definitive study on prompt engineering and how people learn to prompt. I think everyone on our team would agree there’s so much more to do. But I think the thing that struck us was that this debate that we mentioned earlier, you know, is prompting important? Will prompt engineering stay? Maybe it doesn’t matter? It was really a debate that was pretty light on evidence. And so I think the thing that we were excited to do is to sort of, you know, start to chip away at this big question with data and with, you know, an experiment and just try to start developing some understanding of how prompting works. And I think there’s tons more to do.

HUIZINGA: Right, right, right.

SURI: Just to add to that …

HUIZINGA: Yeah, please.

SURI: Again, if there’s any sort of young scientists out there, one of the things I hate doing with other scientists is arguing about what’s the answer to this question. So what I always do when there’s an argument is I just shift the argument to instead of arguing about is this question going to be yes or no, is what’s the data we need to answer the question? And that’s where David and I, sort of, came in. There was this argument going on. Instead of just arguing between the two of us about what we think it’s going to be, we just shifted the conversation to, OK dude, what data do we need to gather to figure out the answer to this question? And then boom, this project was off and running.

HUIZINGA: You know, that could solve so many arguments, you know, in real life, just like, you don’t know and I don’t know, why are we arguing? Let’s go find out.

SURI: Yeah, so instead of arguing about who knows what, let’s argue about what’s the data we need so that we’ll be convinced!

HUIZINGA: Well, on that line, Sid, another paper in the literature that you looked at was called The prompt report: A systematic survey of prompting techniques. And we’ve talked a little bit about what those techniques involve. But what has your research added to the conversation? Specifically, I’m interested to know, I mean, we did talk about tricks, but is there coaching involved or is this just sort of feel-your-way-in-the-dark kind of thing? And how fine is the line between what you referred to as alchemy and chemistry in this field?

SURI: The alchemy and chemistry analogy was David’s brilliant analogy, and what he was saying was, way back when, there was alchemy, and then out of that grew chemistry. And at the moment, there’s these, sort of, niche, esoteric ways of prompting—chain-of-thought, embody a persona, this kind of thing. And how are those going to get propagated out into the mainstream? That’s how we go from alchemy to, sort of, chemistry. That was his brilliant analogy. And there’s several punchlines of our work, but one of the punchlines is, people can figure out how to take advantage of the new capabilities of these models on their own, even when they don’t know the model changed. So that’s a great democratization argument.

HUIZINGA: Hmm …

SURI: That, OK, you don’t need to be the six-figure Silicon Valley hotshot to figure this out. That maybe, maybe everyone in the world who has access—who has internet access, electricity, and access to one of these models—they can sort of pick themselves up by their own bootstraps, learn how to use these things on their own. And I want to go back to an analogy you said a while ago, which was the analogy to traditional internet search, …

HUIZINGA: Yeah.

SURI: OK? People forgot this, but we’ve learned how to search over the course of about 30 years. I’m 45 years old, so I remember the early search engines like AltaVista, Lycos, things like that. And basically, getting anything useful out of them was pretty much impossible. I really wanted to swear right there, but I didn’t. [LAUGHTER] And what people forgot, people forgot that they didn’t know how to ride a bike, OK? And they forgot that we didn’t actually know … these systems didn’t work that well; we didn’t know how to query them that well; we didn’t know how to get anything useful out of them. And then 30 years later, no one thinks about searching the internet as a thing we do. It’s like turning on the faucet. You just do it. It’s taken for granted. It’s part of our workflows. It’s part of our daily life. We do it without thinking about it. Right now, we’re back in those AltaVista/Lycos days, like, where, you know, it’s still esoteric. It’s still niche. We’re still not getting what we need out of these models. The models are going to change. People are going to get better at it. And part of what we’re arguing in our paper is that people can get better at it on their own. All they need is access and a few tries and they figure it out.

HUIZINGA: Right. You know what’s really funny is, I was trying to find some information about a paper on Sparks. That’s the Sparks paper. And I was doing some internet search, and I wasn’t getting what I wanted. And then I moved over to ChatGPT and put basically the same question, but it was a little more question-oriented instead of keywords, and it gave me everything I was looking for. And I thought, wow, that’s a huge leap from even … that I could use ChatGPT like a search engine only better. So … well, listen, anyone who’s ever listened to my podcast knows I’m borderline obsessed with thinking about unintended consequences of technical innovation, so I always ask what could possibly go wrong if you got everything right. But as I’ve said on this series before, one of the main mandates of AICE research is to identify unintended consequences and try to get ahead of them. So, David, rather than talking about the potential pitfalls of prompt engineering, instead talk about what we need to do to keep up with or keep ahead of the speeding train of generative AI. And by we, I mean you.

HOLTZ: Yeah, I mean, I think the thing to keep in mind—and I think this has come up a couple of times in this conversation already—is at least right now, and presumably for the foreseeable future, you know, generative AI is moving so fast and is also not a monolith, right. Like, I think we tend to talk about generative AI, but there’s different types of models, even within a particular class of models. There’s so many different models that are floating around out there. And so I think it’s important to just keep on sort of revisiting things that we think we already know, seeing if those things remain true. You know, I think from a research perspective, like, kind of, answering the same questions over and over with different models over time and seeing if the results stay the same. And I think that’s one of the big takeaways from, like, sort of, a policy or applications perspective from our research, as well, is that just generative AI is moving really quickly. These models are evolving, and the way that we interact with them, the way that we prompt them, needs to change. So if you think about it, you know, there are many tech companies, many startups, that are building products or building entire, you know, companies on, basically, on top of API calls to OpenAI or to Anthropic or something like that. And behind the scenes, those models are changing all the time, whether it’s, you know, sort of a publicly announced shift from GPT-3.5 to GPT-4 or whether it’s the fact that maybe, you know, GPT-4 is kind of being tweaked and adjusted, you know, every couple of weeks based on things that are happening internally at the company. And one of the takeaways from our research is that, you know, all those tweaks are actually pretty meaningful. The prompts that you wrote two weeks ago might not be as effective you know today if they aren’t as well suited to the to the newest, latest, greatest model. And so I think just being really cognizant of that moving target, of the fact that we are living through, sort of, like, very exciting, unprecedented, crazy times and kind of just staying alert and staying on our toes is I think probably the most important thing.

HUIZINGA: Yeah. You know, when I was thinking about that question, I, my mind went to the Wallace & Gromit … I don’t know if you’re familiar with those animations, but there’s a scene where they’re on a toy train track chasing a criminal penguin, and they run out of track and then Gromit miraculously finds spare track. He starts laying it as the train is going. And it sort of feels like there’s a little bit of that in your research! [LAUGHS] I usually ask my guests on Collaborators where their research is on the spectrum from lab to life. But you’ve actually completed this particular study, and it leans more toward policy than product. And again, we’ve talked about a lot of this. Sometimes there seems to be a Venn diagram overlap with my questions. But, Sid, I want to know from your perspective, what would be a good outcome for this particular study, in your mind?

SURI: So AI systems are more and more being embedded in the workflows of companies and institutions. It used to just be all software, but now it’s specifically custom-built software, AI systems, and their prompts. I see it all the time here at Microsoft. It’s part of our workflows. It’s part of our products. It’s part of our day-to-day life. And as the models are getting better and better and these prompts are sort of embedded in our systems, someone’s got to pay attention to those prompts to make sure they’re still behaving the way we thought they were because they were written for an older version, the model changed, and now is that new model interpreting that prompt in the same way? That’s one question. The second question is, well, the new model has new capabilities, so now can you boost these prompts to take advantage of those new capabilities, to get the full economic gain, the full productivity gain of these new models? So you want to get your value for your money, so you need to adjust your prompts in response to those new models to get the full value. And part of the point of this paper is that that’s actually not that big a deal. That, as the models get better and better, even when people don’t know about it, they can still take advantage of the new affordances, the new capabilities, even when they aren’t made aware that, hey, it does a different thing right now.

HUIZINGA: Interesting.

SURI: But the point we’re making with this paper is, you have to pay attention to that.

HUIZINGA: OK, it’s last word time and I want to go a little off script with you two for this show. NVIDIA’s co-founder and CEO Jensen Huang recently said, and I paraphrase Willie Nelson here, “Mamas don’t let your babies grow up to be coders.” In essence, he’s predicting that AI is going to do that for us in the future and people would be better served pursuing different educational priorities. So that’s a bold claim. Do you guys want to make a bold claim? Here’s your chance to make a pithy prediction from your perch in research. What’s something you think will be true some years out? You don’t have to say how many years, but that you might have been reluctant to say out loud for fear that it wouldn’t age well. Remember, this is a podcast, not a paper, so no one’s going to hold you to your word, but you might end up being prophetic. Who knows? David, you go first, and then Sid can close the show. Tell us what’s going to happen in the future.

HOLTZ: I’m not sure how bold of a prediction this is, but I think there’s a lot of concern right now about the impact that AI will have in various creative domains, right. As generative AI gets better and AI can produce images and music and videos, you know, what will happen to all of the people that have been making a living creating this type of content? And my belief is that, if anything, as we just get flooded with more and more AI-generated content, people are going to place a really heavy premium on content that is produced by humans. Like, I think so much of what people value about art and creative output is the sort of human connection and the idea that something sort of emerged from someone’s lived experiences and hardships. I mean, this is why people really like reading, you know, the curator’s notes when they go to a museum, so that they can kind of understand what’s behind, you know, behind the image. And so I think generative AI is going to be really amazing in a lot of ways, and I think it will have really big impacts that we’ll need to deal with as a society in terms of how it affects work and things like that. But I don’t think that we’re moving towards a future where, you know, we’re all just consuming AI-generated, you know, art all the time and we don’t care at all about things being made by people.

HUIZINGA: You know, there’s a podcast called Acquired, and they talked about the brand Hermès,which is the French luxury leather company, and saying that to get a particular kind of bag that’s completely handmade—it’s an artifact from a human—that’s why you pay tens of thousands of dollars for those instead of a bag that comes off a factory line. So I like that. Sid, what do you think?

SURI: So I’m going to make two points. David made the argument about AI affecting the creative space. I want to zoom in on the knowledge workspace.

HUIZINGA: Hmm …

SURI: And one of the big issues in knowledge work today is it’s incredibly difficult still to get insights out of data. To give you an example, in the remote work study that David and I did, it took a handful of PhDs, tons of data, two years, sophisticated statistical techniques to make sense of what is the effect of remote work on information workers, OK? And I feel, where I see knowledge work going is there’s going to be this great democratization on how to get insights out of data. These models are very good at classifying things, summarizing things, categorizing things. Massive amounts of data. In the old days, you had to like basically be an advanced statistician, be an advanced machine learning person, train one of these models. They’re very esoteric. They’re very arcane. They’re very hard to use. And then unleash it on your data. Now if you just know how to prompt a little bit, you can get these same insights as a professional statistician would a few years ago in a much, much shorter time, you know, one-tenth of the time. So I feel like there’s going to be this great democratization of getting insights out of data in the knowledge workspace. That’s prediction number one. And then the second point I wanted to make, and I want to give a little credit to some of the academics who’ve inspired this notion, which is Erik Brynjolfsson and David Autor, and that is this: I think a lot of people are looking for the impact of AI in kind of the wrong way. Rewind in your mind back to the time when, like, the internal combustion engine was invented. OK, so we used to get around with horses; now we have cars. OK, horses went 20 miles an hour; cars go 40 miles an hour. OK, big deal. What no one foresaw was there’s going to be an entire aviation industry that’s going to make it possible to do things we couldn’t do before, speed up the economy, speed up everything, add trillions of dollars of value to the world. And I feel like right now everyone’s focusing on AI to do things we already know how to do. And I don’t think that’s the most interesting use case. Let’s instead turn our attention to, what could we not do before that we can do now?

HUIZINGA: Right.

SURI: And that’s where the really exciting stuff is. So those are the two points I’d like to leave you.

HUIZINGA: I love it. I hope you’re not saying that I could rewind my mind to when the internal combustion engine was developed …

SURI: No, no. Present company excluded! [LAUGHTER]

HUIZINGA: Oh my gosh. Sid Suri, David Holtz, this has been fantastic. I can’t get the phrase “AI whisperer” out of my head now, [LAUGHTER] and I think that’s what I want to be when I grow up. So thanks for coming on the show to share your insights on the topic and help to illuminate the path. This is awesome.

SURI: Thank you.

HOLTZ: Well, thank you.

SURI: That was fun.

[MUSIC FADES]

The post Collaborators: Prompt engineering with Siddharth Suri and David Holtz appeared first on Microsoft Research.

Read More

From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy

From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy

AI2BMD blog hero - illustration of a chip with network nodes extending from all sides

The essence of the biological world lies in the ever-changing nature of its molecules and their interactions. Understanding the dynamics and interactions of biomolecules is crucial for deciphering the mechanisms behind biological processes and for developing biomaterials and drugs. As Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and wigglings of atoms.” Yet capturing these real-life movements is nearly impossible through experiments. 

In recent years, with the development of deep learning methods represented by AlphaFold and RoseTTAFold, predicting the static crystal protein structures has been achieved with experimental accuracy (as recognized by the 2024 Nobel Prize in Chemistry). However, accurately characterizing dynamics at an atomic resolution remains much more challenging, especially when the proteins play their roles and interact with other biomolecules or drug molecules.

As one approach, Molecular Dynamics (MD) simulation combines the laws of physics with numerical simulations to tackle the challenge of understanding biomolecular dynamics. This method has been widely used for decades to explore the relationship between the movements of molecules and their biological functions. In fact, the significance of MD simulations was underscored when the classic version of this technique was recognized with a Nobel Prize in 2013 (opens in new tab) (opens in new tab), highlighting its crucial role in advancing our understanding of complex biological systems. Similarly, the quantum mechanical approach—known as Density Functional Theory (DFT)—received its own Nobel Prize in 1998 (opens in new tab) (opens in new tab), marking a pivotal moment in computational chemistry.  

In MD simulations, molecules are modeled at the atomic level by numerically solving equations of motions that account for the system’s time evolution, through which kinetic and thermodynamic properties can be computed. MD simulations are used to model the time-dependent motions of biomolecules. If you think of proteins like intricate gears in a clock, AI2BMD doesn’t just capture them in place—it watches them spin, revealing how their movements drive the complex processes that keep life running.

MD simulations can be roughly divided into two classes: classical MD and quantum mechanics. Classical MD employs simplified representations of the molecular systems, achieving fast simulation speed for long-time conformational changes but less accurate. In contrast, quantum mechanics models, such as Density Functional Theory, provide ground-up calculations, but are computationally prohibitive for large biomolecules.

Ab initio biomolecular dynamics simulation by AI 

Microsoft Research has been working on the development of efficient methods aiming for ab initio accuracy simulations of biomolecules. This method, AI2BMD (AI-based ab initio biomolecular dynamics system), has published in the journal Nature (opens in new tab), representing the culmination of a four-year research endeavor.

AI2BMD efficiently simulates a wide range of proteins in all-atom resolution with more than 10,000 atoms at an approximate ab initio—or first-principles—accuracy. It thus strikes a previously inaccessible tradeoff for biomolecular simulations than standard simulation techniques – achieving higher accuracies than classical simulation, at a computational cost that is higher than classical simulation but orders of magnitude faster than what DFT could achieve. This development could unlock new capabilities in biomolecular modeling, especially for processes where high accuracy is needed, such as protein-drug interactions. 

Fig.1 The overall pipeline of AI2BMD. Proteins are divided into protein units by a fragmentation process. The AI2BMD potential is designed based on ViSNet, and the datasets are generated at the DFT level. It calculates the energy and atomic forces for the whole protein. The AI2BMD simulation system is built upon these components and provides a generalizable solution for simulating the molecular dynamics of proteins. It achieves ab initio accuracy in energy and force calculations. Through comprehensive analysis from both kinetics and thermodynamics perspectives, AI2BMD exhibits good alignment with wet-lab experimental data and detects different phenomena compared to molecular mechanics.
Figure 1. The flowchart of AI2BMD

AI2BMD employs a novel-designed generalizable protein fragmentation approach that splits proteins into overlapping units, creating a dataset of 20 million snapshots—the largest ever at the DFT level. Based on our previously designed ViSNet (opens in new tab), a universal molecular geometry modeling foundation model published in Nature Communications (opens in new tab) and incorporated into PyTorch Geometry library (opens in new tab), we trained AI2BMD’s potential energy function using machine learning. Simulations are then performed by the highly efficient AI2BMD simulation system, where at each step, the AI2BMD potential based on ViSNet calculates the energy and atomic forces for the protein with ab initio accuracy. By comprehensive analysis from both kinetics and thermodynamics, AI2BMD exhibits much better alignments with wet-lab data, such as the folding free energy of proteins and different phenomenon than classic MD.   

Microsoft research podcast

Abstracts: August 15, 2024

Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.


Advancing biomolecular MD simulation

AI2BMD represents a significant advancement in the field of MD simulations from the following aspects: 

(1) Ab initio accuracy: introduces a generalizable “machine learning force field,” a machine learned model of the interactions between atoms and molecules, for full-atom protein dynamics simulations with ab initio accuracy.

Fig.2 Evaluation of energy and force calculations by AI2BMD and molecular mechanics (MM). The upper panel exhibits the folded structures of four evaluated proteins. The lower panel exhibits the mean absolute error (MAE) of potential energy.
Figure 2. Evaluation on the energy calculation error between AI2BMD and Molecular Mechanics (MM) for different proteins. 

(2) Addressing generalization: It is the first to address the generalization challenge of a machine learned force field for simulating protein dynamics, demonstrating robust ab initio MD simulations for a variety of proteins. 

(3) General compatibility: AI2BMD expands the Quantum Mechanics (QM) modeling from small, localized regions to entire proteins without requiring any prior knowledge on the protein. This eliminates the potential incompatibility between QM and MM calculations for proteins and accelerates QM region calculation by several orders of magnitude, bringing near ab initio calculation for full-atom proteins to reality. Consequently, AI2BMD paves the road for numerous downstream applications and allows for a fresh perspective on characterizing complex biomolecular dynamics.

(4) Speed advantage: AI2BMD is several orders of magnitude faster than DFT and other quantum mechanics. It supports ab initio calculations for proteins with more than 10 thousand atoms, making it one of the fastest AI-driven MD simulation programs among multidisciplinary fields.

Fig.3 Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software. The left panel shows the time consumption of AI2BMD and DFT. The right panel shows the time consumption of AI2BMD, DPMD and Allegro.
Figure 3. Comparison of time consumption between AI2BMD, DFT and other AI driven simulation software. 

(5) Diverse conformational space exploration: For the protein folding and unfolding simulated by AI2BMD and MM, AI2BMD explores more possible conformational space that MM cannot detect. Therefore, AI2BMD opens more opportunities to study flexible protein motions during the drug-target binding process, enzyme catalysis, allosteric regulations, intrinsic disorder proteins and so on, better aligning with the wet-lab experiments and providing more comprehensive explanations and guidance to biomechanism detection and drug discovery. 

Fig.4 Analysis of the simulation trajectories performed by AI2BMD. In the upper panel, AI2BMD folds protein of Chignolin starting from an unfolded structure and achieves smaller energy error than MM. In the lower panel, it explores more conformational regions that MM cannot detect.
Figure 4. AI2BMD folds protein of Chignolin starting from an unfolded structure, achieves smaller energy error than MM and explores more conformational regions that MM cannot detect. 

(6) Experimental agreement: AI2BMD outperforms the QM/MM hybrid approach and demonstrates high consistency with wet-lab experiments on different biological application scenarios, including J-coupling, enthalpy, heat capacity, folding free energy, melting temperature, and pKa calculations.

Looking ahead

Achieving ab initio accuracy in biomolecular simulations is challenging but holds great potential for understanding the mystery of biological systems and designing new biomaterials and drugs. This breakthrough is a testament to the vision of AI for Science—an initiative to channel the capabilities of artificial intelligence to revolutionize scientific inquiry. The proposed framework aims to address limitations regarding accuracy, robustness, and generalization in the application of machine learning force fields. AI2BMD provides generalizability, adaptability, and versatility in simulating various protein systems by considering the fundamental structure of proteins, namely stretches of amino acids. This approach enhances energy and force calculations as well as the estimation of kinetic and thermodynamic properties. 

One key application of AI2BMD is its ability to perform highly accurate virtual screening for drug discovery. In 2023, at the inaugural Global AI Drug Development competition (opens in new tab),  AI2BMD made a breakthrough by predicting a chemical compound that binds to the main protease of SARS-CoV-2. Its precise predictions surpassed those of all other competitors, securing first place and showcasing its immense potential to accelerate real-world drug discovery efforts. 

Since 2022, Microsoft Research also partnered with the Global Health Drug Discovery Institute (GHDDI), a nonprofit research institute founded and supported by the Gates Foundation, to apply AI technology to design drugs that treat diseases that unproportionally affect low- and middle- income countries (LMIC), such as tuberculosis and malaria. Now, we have been closely collaborating with GHDDI to leverage AI2BMD and other AI capabilities to accelerate the drug discovery process. 

AI2BMD can help advance solutions to scientific problems and enable new biomedical research in drug discovery, protein design, and enzyme engineering.  

The post From static prediction to dynamic characterization: AI2BMD advances protein dynamics with ab initio accuracy appeared first on Microsoft Research.

Read More

Abstracts: November 5, 2024

Abstracts: November 5, 2024

Outlined illustrations of Chris Hawblitzel and Jay Lorch for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Microsoft senior principal researchers Chris Hawblitzel and Jay Lorch join host Amber Tingle to discuss “Verus: A Practical Foundation for Systems Verification,” which received the Distinguished Artifact Award at this year’s Symposium on Operating Systems Principles, or SOSP. In their research, Hawblitzel, Lorch, and their coauthors leverage advances in programming languages and formal verification with two aims. The first aim is to help make software verification more accessible for systems developers so they can demonstrate their code will behave as intended. The second aim is to provide the research community with sound groundwork to tackle the application of formal verification to large, complex systems. 

Transcript 

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Our guests today are Chris Hawblitzel and Jay Lorch. They are both senior principal researchers at Microsoft and two of the coauthors on a paper called “Verus: A Practical Foundation for Systems Verification.” This work received the Distinguished Artifact Award at the 30th Symposium on Operating Systems Principles, also known as SOSP, which is happening right now in Austin, Texas. Chris and Jay, thank you for joining us today for Abstracts and congratulations!

JAY LORCH: Thank you for having us. 

CHRIS HAWBLITZEL: Glad to be here. 

TINGLE: Chris, let’s start with an overview. What problem does this research address, and why is Verus something that the broader research community should know about? 


HAWBLITZEL: So what we’re trying to address is a very simple problem where we’re trying to help developers write software that doesn’t have bugs in it. And we’re trying to provide a tool with Verus that will help developers show that their code actually behaves the way it’s supposed to; it obeys some sort of specification for what the program is supposed to do. 

TINGLE: How does this publication build on or differ from other research in this field, including your previous Verus-related work? 

HAWBLITZEL: So formal verification is a process where you write down what it is that you want your program to do in mathematical terms. So if you’re writing an algorithm to sort a list, for example, you might say that the output of this algorithm should be a new list that is a rearrangement of the elements of the old list, but now this rearrangement should be in sorted order. So you can write that down using standard mathematics. And now given that mathematical specification, the challenge is to prove that your piece of software written in a particular language, like Java or C# or Rust, actually generates an output that meets that mathematical specification. So this idea of using verification to prove that your software obeys some sort of specification, this has been around for a long time, so, you know, even Alan Turing talked about ways of doing this many, many decades ago. The challenge has always been that it’s really hard to develop these proofs for any large piece of software. It simply takes a long time for a human being to write down a proof of correctness of their software. And so what we’re trying to do is to build on earlier work in verification and recent developments in programming languages to try to make this as easy as possible and to try to make it as accessible to ordinary software developers as possible. So we’ve been using existing tools. There are automated theorem provers—one of them from Microsoft Research called Z3—where you give it a mathematical formula and ask it to prove that the formula is valid. We’re building on that. And we’re also taking a lot of inspiration from tools developed at Microsoft Research and elsewhere, like Dafny and F* and so on, that we’ve used in the past for our previous verification projects. And we’re trying to take ideas from those and make them accessible to developers who are using common programming languages. In this case, the Rust programming language is what we’re focusing on. 

TINGLE: Jay, could you describe your methodology for us and maybe share a bit about how you and your coauthors tested the robustness of Verus.

LORCH: So the question we really want to answer is, is Verus suitable for systems programming? So that means a variety of things. Is it amenable to a variety of kinds of software that you want to build as part of a system? Is it usable by developers? Can they produce compact proofs? And can they get timely feedback about those proofs? Can the verifier tell you quickly that your proof is correct or, if it’s wrong, that it’s wrong and guide you to fix it? So the main two methodological techniques we used were millibenchmarks and full systems. So the millibenchmarks are small pieces of programs that have been verified by other tools in the past, and we built them in Verus and compared to what other tools would do to find whether we could improve usability. And we found generally that we could verify the same things but with more compact proofs and proofs that would give much snappier feedback. The difference between one second and 10 seconds might not seem a lot, but when you’re writing code and working with the verifier, it’s much nicer to get immediate feedback about what is wrong with your proof so you can say, oh, what about this? And it can say, oh, well, I still see a problem there. And you could say, OK, let me fix that. As opposed to waiting 10, 20 seconds between each such query to the verifier. So the millibenchmarks helped us evaluate that. And the macrobenchmarks, the building entire systems, we built a couple of distributed systems that had been verified before—a key value store and a node replication system—to show that you could do them more effectively and with less verification time. We also built some new systems, a verified OS page table, a memory allocator, and a persistent memory append-only log. 

TINGLE: Chris, the paper mentions that successfully verifying system software has required—you actually use the word heroic to describe the developer effort. Thinking of those heroes in the developer community and perhaps others, what real-world impact do you expect Verus to have? What kind of gains are we talking about here? 

HAWBLITZEL: Yeah, so I think, you know, traditionally verification or this formal software verification that we’re doing has been considered a little bit of a pie-in-the-sky research agenda. Something that people have applied to small research problems but has not necessarily had a real-world impact before. And so I think it’s just, you know, recently, in the last 10 or 15 years, that we started to see a change in this and started to see verified software actually deployed in practice. So on one of our previous projects, we worked on verifying the cryptographic primitives that people use when, say, they browse the web or something and their data is encrypted. So in these cryptographic primitives, there’s a very clear specification for exactly what bytes you’re supposed to produce when you encrypt some data. And the challenge is just writing software that actually performs those operations and does so efficiently. So in one of our previous projects that we worked on called HACL* and EverCrypt, we verified some of the most commonly used and efficient cryptographic primitives for things like encryption and hashing and so on. And these are things that are actually used on a day-to-day basis. So we, kind of, took from that experience that the tools that we’re building are getting ready for prime time here. We can actually verify software that is security critical, reliability critical, and is in use. So some of the things that Jay just mentioned, like verifying, you know, persistent memory storage systems and so on, those are the things that we’re looking at next for software that would really benefit from reliability and where we can formally prove that your data that’s written to disk is read correctly back from disk and not lost during a crash, for example. So that’s the kind of software that we’re looking to verify to try to have a real-world impact. 

LORCH: The way I see the real-world impact, is it going to enable Microsoft to deal with a couple of challenges that are severe and increasing in scale? So the first challenge is attackers, and the second challenge is the vast scale at which we operate. There’s a lot of hackers out there with a lot of resources that are trying to get through our defenses, and every bug that we have offers them purchase, and techniques like this, that can get rid of bugs, allow us to deal with that increasing attacker capability. The other challenge we have is scale. We have billions of customers. We have vast amounts of data and compute power. And when you have a bug that you’ve thoroughly tested but then you run it on millions of computers over decades, those rare bugs eventually crop up. So they become a problem, and traditional testing has a lot of difficulty finding those. And this technology, which enables us to reason about the infinite possibilities in a finite amount of time and observe all possible ways that the system can go wrong and make sure that it can deal with them, that enables us to deal with the vast scale that Microsoft operates on today.

HAWBLITZEL: Yeah, and I think this is an important point that differentiates us from testing. Traditionally, you find a bug when you see that bug happen in running software. With formal verification, we’re catching the bugs before you run the software at all. We’re trying to prove that on all possible inputs, on all possible executions of the software, these bugs will not happen, and it’s much cheaper to fix bugs before you’ve deployed the software that has bugs, before attackers have tried to exploit those bugs. 

TINGLE: So, Jay, ideally, what would you like our listeners and your fellow SOSP conference attendees to tell their colleagues about Verus? What’s the key takeaway here? 

LORCH: I think the key takeaway is that it is possible now to build software without bugs, to build systems code that is going to obey its specification on all possible inputs always. We have that technology. And this is possible now because a lot of technology has advanced to the point where we can use it. So for one thing, there’s advances in programming languages. People are moving from C to Rust. They’ve discovered that you can get the high performance that you want for systems code without having to sacrifice the ability to reason about ownership and lifetimes, concurrency. The other thing that we build on is advances in computer-aided theorem proving. So we can really make compact and quick-to-verify mathematical descriptions of all possible behaviors of a program and get fast answers that allow us to rapidly turn around proof challenges from developers. 

TINGLE: Well, finally, Chris, what are some of the open questions or future opportunities for formal software verification research, and what might you and your collaborators tackle next? I heard a few of the things earlier. 

HAWBLITZEL: Yes, I think despite, you know, the effort that we and many other researchers have put into trying to make these tools more accessible, trying to make them easier to use, there still is a lot of work to prove a piece of software correct, even with advanced state-of-the-art tools. And so we’re still going to keep trying to push to make that easier. Trying to figure out how to automate the process better. There’s a lot of interest right now in artificial intelligence for trying to help with this, especially if you think about artificial intelligence actually writing software. You ask it to write a piece of software to do a particular task, and it generates some C code or some Rust code or some Java code, and then you hope that that’s correct because it could have generated any sort of code that performs the right thing or does total nonsense. So it would be really great going forward if when we ask AI to develop software, we also expect it to create a proof that the software is correct and does what the user asked for. We’ve started working on some projects, and we found that the AI is not quite there yet for realistic code. It can do small examples this way. But I think this is still a very large challenge going forward that could have a large payoff in the future if we can get AI to develop software and prove that the software is correct. 

LORCH: Yeah, I see there’s a lot of synergy between—potential synergy—between AI and verification. Artificial intelligence can solve one of the key challenges of verification, namely making it easy for developers to write that code. And verification can solve one of the key challenges of AI, which is hallucinations, synthesizing code that is not correct, and Verus can verify that that code actually is correct. 

TINGLE: Well, Chris Hawblitzel and Jay Lorch, thank you so much for joining us today on the Microsoft Research Podcast to discuss your work on Verus. 

[MUSIC] 

HAWBLITZEL: Thanks for having us. 

LORCH: Thank you. 

TINGLE: And to our listeners, we appreciate you, too. If you’d like to learn more about Verus, you’ll find a link to the paper at aka.ms/abstracts or you can read it on the SOSP website. Thanks for tuning in. I’m Amber Tingle, and we hope you’ll join us again for Abstracts.

[MUSIC FADES] 

The post Abstracts: November 5, 2024 appeared first on Microsoft Research.

Read More

Abstracts: November 4, 2024

Abstracts: November 4, 2024

Outlined illustrations of Shan Lu and Bogdan Stoica for the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Research Manager Shan Lu and Bogdan Stoica, a PhD candidate at the University of Chicago, join host Gretchen Huizinga to discuss “If At First You Don’t Succeed, Try, Try, Again … ? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems.” In the paper, which was accepted at this year’s Symposium on Operating Systems Principles, or SOSP, Lu, Stoica, and their coauthors examine typical retry issues and present techniques that leverage traditional program analysis and large language models to help detect them.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Dr. Shan Lu, a senior principal research manager at Microsoft Research, and Bogdan Stoica, also known as Bo, a doctoral candidate in computer science at the University of Chicago. Shan and Bogdan are coauthors of a paper called “If at First You Don’t Succeed, Try, Try, Again …? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems.” And this paper was presented at this year’s Symposium on Operating Systems Principles, or SOSP. Shan and Bo, thanks for joining us on Abstracts today!


SHAN LU: Thank you.

BOGDAN STOICA: Thanks for having us.

HUIZINGA: Shan, let’s kick things off with you. Give us a brief overview of your paper. What problem or issue does it address, and why should we care about it?

LU: Yeah, so basically from the title, we are looking at retry bugs in software systems. So what retry means is that people may not realize for big software like the ones that run in Microsoft, all kinds of unexpected failures—software failure, hardware failure—may happen. So just to make our software system robust, there’s often a retry mechanism built in. So if something unexpected happens, a task, a request, a job will be re-executed. And what this paper talks about is, it’s actually very difficult to implement this retry mechanism correctly. So in this paper, we do a study to understand what are typical retry problems and we offer a solution to detecting these problems.

HUIZINGA: Bo, this clearly isn’t a new problem. What research does your paper build on, and how does your research challenge or add to it?

STOICA: Right, so retry is a well-known mechanism and is widely used. And retry bugs, in particular, have been identified in other papers as root causes for all sorts of failures but never have been studied as a standalone class of bugs. And what I mean by that, nobody looked into, why is it so difficult to implement retry? What are the symptoms that occur when you don’t implement retry correctly? What are the causes of why developers struggle to implement retry correctly? We built on a few key bug-finding ideas that have been looked at by other papers but never in this context. We use fault injection. We repurpose existing unit tests to trigger this type of bugs as opposed to asking developers to write specialized tests to trigger retry bugs. So we’re, kind of, making the developer’s job easier in a sense. And in this pipeline, we also rely on large language models to augment the program and the code analysis that goes behind the fault injection and the reutilization of existing tests.

HUIZINGA: Have large language models not been utilized much in this arena?

LU: I want to say that, you know, actually this work was started about two years ago. And at that time, large language model was really in its infancy and people just started exploring what large language model can help us in terms of improving software reliability. And our group, and together with, you know, actually same set of authors from Microsoft Research, we actually did some of the first things in a workshop paper just to see what kind of things that we were able to do before like, you know, finding bugs can now be replicated by using large language model.

HUIZINGA: OK …

LU: But at that time, we were not very happy because, you know, just use large language model to do something people were able to do using traditional program analysis, I mean, it seems cool, right, but does not add new functionality. So I would say what is new, at least when we started this project, is we were really thinking, hey, are there anything, right, are there some program analysis, are there some bug finding that we were not able to do using traditional program analysis but actually can be enabled by large language model.

HUIZINGA: Gotcha …

LU: And so that was at, you know, what I feel like was novel at least, you know, when we worked on this. But of course, you know, large language model is a field that is moving so fast. People are, you know, finding new ways to using it every day. So yeah.

HUIZINGA: Right. Well, in your paper, you say that retry functionality is commonly undertested and thus prone to problems slipping into production. Why would it be undertested if it’s such a problem?

STOICA: So testing retry is difficult because what you need is to simulate the systemwide conditions that lead to retry. That often means simulating external transient errors that might happen on the system that runs your application. And to do this during testing and capture this in a small unit test is difficult.

LU: I think, actually, Bogdan said this very well. It’s like, why do we need a retry? It’s, like, when unexpected failure happen, right. And this is, like, something like Bogdan mentioned, like external transient error such as my network card suddenly does not work, right. And this may occur, you know, only for, say, one second and then it goes back on. But this one second may cause some job to fail and need retry. So during normal testing, these kind of unexpected things rarely, rarely happen, if at all, and it’s also difficult to simulate. That’s why it’s just not well tested.

HUIZINGA: Well, Shan, let’s talk about methodology. Talk a bit about how you tackled this work and why you chose the approach you did for this particular problem.

LU: Yeah, so I think this work includes two parts. One is a systematic study. We study several big open-source systems to see whether there are retry-related problems in this real system. Of course there are. And then we did a very systematic categorization to understand the common characteristics. And the second part is about, you know, detecting. And in terms of method, we have used, particularly in the detecting part, we actually used a hybrid of techniques of traditional static program analysis. We used this large language model-enabled program analysis. In this case, imagine we just asked a large language model saying, hey, tell us, are there any retry implemented in this code? If there is, where it is, right. And then we also use, as Bogdan mentioned, we repurposed unit test to help us to execute, you know, the part of code that large language model tell us there may be a retry. And addition to that, we also used fault injection, which means we simulate those transient, external, environmental failures such as network failures that very rarely would occur by itself.

HUIZINGA: Well, Bo, I love the part in every paper where the researchers say, “And what we found was …” So tell us, what did you find?

STOICA: Well, we found that implementing retry is difficult and complex! Not only find new bugs because, yes, that was kind of the end goal of the paper but also try to understand why these bugs are happening. As Shan mentioned, we started this project with a bug study. We looked at retry bugs across eight to 10 applications that are widely popular, widely used, and that the community is actively contributing to them. And the experiences of both users and developers, if we can condense that—what do you think about retries?—is that, yeah, they’re frustrated because it’s a simple mechanism, but there’s so many pitfalls that you have to be aware of. So I think that’s the biggest takeaway. Another takeaway is that when I was thinking about bug-finding tools, I was having this somewhat myopic view of, you know, you instrument at the program statement level, you figure out relationships between different lines of code and anti-patterns, and then you build your tools to find those anti-patterns. Well, with retry, this kind of gets thrown out the window because retry is a mechanism. It’s not just one line of code. It is multiple lines of code that span multiple functions, multiple methods, and multiple files. And you need to think about retry holistically to find these issues. And that’s one of the reasons we used large language models, because traditional static analysis or traditional program analysis cannot capture this. And, you know, large language models turns out to be actually great at this task, and we try to harness the, I would say, fuzzy code comprehension capabilities of large language models to help us find retry bugs.

HUIZINGA: Well, Shan, research findings are important, but real-world impact is the ultimate goal here. So who will this research help most and why?

LU: Yeah, that’s a great question. I would consider several groups of people. One is hopefully, you know, people who actually build, design real systems will find our study interesting. I hope it will resonate with them about those difficulties in implementing retry because we studied a set of systems and there was a little bit of comparison about how different retry mechanisms are actually used in different systems. And you can actually see that, you know, this different mechanism, you know, they have pros and cons, and we have a little bit of, you know, suggestion about what might be good practice. That’s the first group. The second group is, our tool actually did find, I would say, a relatively large number of retry problems in the latest version of every system we tried, and we find these problems, right, by repurposing existing unit tests. So I hope our tool will be used, you know, in the field by, you know, being maybe integrated with future unit testing so that our future system will become more robust. And I guess the third type of, you know, audience I feel like may benefit by reading our work, knowing our work: the people who are thinking about how to use large language model. And as I mentioned, I think a takeaway is large language model can repeat, can replace some of things we were able to do using traditional program analysis and it can do more, right, for those fuzzy code comprehension–related things. Because for traditional program analysis, we need to precisely describe what I want. Like, oh, I need a loop. I need a WRITE statement, right. For large language model, it’s imprecise by nature, and that imprecision sometimes actually match with the type of things we’re looking for.

HUIZINGA: Interesting. Well, both of you have just, sort of, addressed nuggets of this research. And so the question that I normally ask now is, if there’s one thing you want our listeners to take away from the work, what would it be? So let’s give it a try and say, OK, in a sentence or less, if I’m reading this paper and it matters to me, what’s my big takeaway? What is my big “aha” that this research helps me with?

STOICA: So the biggest takeaway of this paper is not to be afraid to integrate large language models in your bug-finding or testing pipelines. And I’m saying this knowing full well how imprecise large language models can be. But as long as you can trust but verify, as long as you have a way of checking what these models are outputting, you can effectively insert them into your testing framework. And I think this paper is showing one use case and bring us closer to, you know, having it integrated more ubiquitously.

HUIZINGA: Well, Shan, let’s finish up with ongoing research challenges and open questions in this field. I think you’ve both alluded to the difficulties that you face. Tell us what’s up next on your research agenda in this field.

LU: Yeah, so for me, personally, I mean, I learned a lot from this project and particularly this idea of leveraging large language model but also as a way to validate its result. I’m actually working on how to leverage large language model to verify the correctness of code, code that may be generated by large language model itself. So it’s not exactly, you know, a follow-up of this work, but I would say at idea, you know, philosophical level, it is something that is along this line of, you know, leverage large language model, leverage its creativity, leverage its … sometimes, you know … leverage its imprecision but has a way, you know, to control it, to verify it. That’s what I’m working on now.

HUIZINGA: Yeah … Bo, you’re finishing up your doctorate. What’s next on your agenda?

STOICA: So we’re thinking of, as Shan mentioned, exploring what large language models can do in this bug-finding/testing arena further and harvesting their imprecision. I think there are a lot of great problems that traditional code analysis has tried to tackle, but it was difficult. So in that regard, we’re looking at performance issues and how large language models can help identify and diagnose those issues because my PhD was mostly focused, up until this point, on correctness. And I think performance inefficiencies are such a wider field and with a lot of exciting problems. And they do have this inherent imprecision and fuzziness to them that also large language models have, so I hope that combining the two imprecisions maybe gives us something a little bit more precise.

HUIZINGA: Well, this is important research and very, very interesting.

[MUSIC]

Shan Lu, Bogdan Stoica, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts. And you can also find it on the SOSP website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 4, 2024 appeared first on Microsoft Research.

Read More