Abstracts: January 25, 2024

Abstracts: January 25, 2024

MSR Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researchers Jordan Ash and Dipendra Misra join host Gretchen Huizinga to discuss “The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction,” which was accepted to the 2024 International Conference on Learning Representations (ICLR). Layer-Selective Rank reduction, or LASER, is an intervention for targeted parameter reduction in transformer-based models. The work shows that the removal of certain parameters not only maintains model performance like some existing parameter-reduction methods but can actually improve it—no additional training necessary.

To learn more about the paper and related topics, register for Microsoft Research Forum (opens in new tab), a series of panel discussions and lightning talks around science and technology research in the era of general AI.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Dipendra Misra and Dr. Jordan Ash, both senior researchers at Microsoft Research. Drs. Misra and Ash are coauthors of a paper called “The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction,” also known as LASER. This paper has been accepted at the International Conference on Learning Representations, or ICLR, in Vienna this year, and you can read a preprint of it now on arXiv. Dipendra, Jordan, thanks for joining us on Abstracts!


JORDAN ASH: Thanks for having us.

DIPENDRA MISRA: Yeah, thanks for having us, Gretchen.

HUIZINGA: Dipendra, let’s start with a general overview of this paper. In a few sentences, describe the issue or problem your work addresses and, perhaps more importantly, why we should care about it.

MISRA: Thanks, Gretchen. So as we know, large language models, also known as LLMs, have revolutionized both business and research in artificial intelligence. They are everywhere, being used to solve a wide range of problems. So in our paper, we introduce an intervention which can be applied to any existing pretrained large language models, and our main purpose for introducing this is to see how it affects the performance of the LLMs and whether we can gain insight into how an LLM stores information in its parameters and how it uses that information to generate a response. And what our intervention does is that it performs a low-rank approximation of the parameters of the LLM. And the surprising discovery that our paper makes is that if you do this intervention correctly, then we can get significant improvement on various tasks for different LLMs.

HUIZINGA: So that’s the first part of the question. Tell me why I should care about it!

MISRA: So if you are a person who uses LLMs for solving any tasks, then you do care about performance on a given task. So, for example, you could be using LLMs to generate an email, right, from a given description. Or you could be using an LLM to do question answering. And by applying our intervention, we can gain accuracy on the task that we care about.

HUIZINGA: Well, let’s stick with you, Dipendra, for a minute and talk about the field writ large. Almost all research owes a debt to some other research that went before. So tell us a bit about the related work in this field and how your work builds on or adds to it.

MISRA: So the work that is most closely related to our LASER paper is this growing body of work on understanding how knowledge is stored and edited inside a large language model. So these works don’t apply the intervention that we do, but they were certainly inspirational for us for arriving at the intervention that we introduced. Another line of work which is very related is, like, adding a small number of parameters to improve the performance of the LLM on a given task. The most relevant work in this space is the LoRA paper, also known as the “Low-Rank Adaptation of Large Language Models,” which came from Microsoft. And what LoRA does, it adds a small number of additional parameters to an LLM and then fine-tunes it on a given task. And what our intervention, called LASER, does is that it removes parameters instead of adding it. And another line of work which is also related is the work on model compression. So there are people who focus on breaking down the size of the models as much as possible while still retaining the performance, more or less, compared to the base model. And so these people are also focused on removing parameters, but they are coming at a different angle of, like, trying to reduce the memory footprint, while what we were doing is that we are less focused on the memory footprint—that’s more like a side effect of it—and more like if I were to fiddle with this parameter of the LLM, then how does it affect the performance? And what can we learn by looking at the comparison? Like, OK, so if I remove this parameter, I see the performance drop; then it means that these parameters are storing something about this type of task on which the performance is dropping.

HUIZINGA: So I’ll ask you one more question, Dipendra, before I pull Jordan into the conversation, and that would be about your methodology. How would you describe your approach to this project, and how did you conduct the research?

MISRA: So we started by analyzing the intervention LASER on a particular LLM called GPT-J and evaluating its performance on this question-answering data CounterFact. So our idea was, like, before trying this thing on [a] bunch of things, let’s just understand this in one setting deeply and, kind of, build insights that we can then evaluate in other settings. And the reason we chose this setup was that the GPT-J large language model has its training data publicly available. It’s called the Pile dataset. And that allows us to do analysis with the training data. For example, is the performance dropping on data points which are rarer or more frequent in the training data? And this is important because training data analysis is frequently omitted in existing LLM literature, and that’s something we wanted to do. And the second reason is that the CounterFact question-answering data is both related to the prior work in this space, so there was a reason for choosing it, but also it has paraphrases of the same question. For example, it might ask, like, “Who is the president of United States of America?” But it will also have paraphrases like “The president of the United States of America is …” or “The head of the government of United States of America is …” And so it will have different variations of the same question. And then you can see if the LLM is able to get all of them right, or is it not robust to variations of the same question? And so we did analysis on this GPT-J and CounterFact dataset. And Jordan will talk more about what the results were. And so based on this rigorous analysis, we developed some insights as to what the intervention is doing. And then we evaluated these insights on other settings. So then we tried, like, two other different large language models and evaluated it on, like, multiple different datasets. And then we saw that the insights actually hold more broadly. And finally, we also evaluated this in a non-text related task, right. Because the intervention could, in principle, be applied to any neural network. So we went after this reinforcement learning model, which solves a puzzle called Sokoban. And we also saw that if you apply this intervention correctly, then you can get some performance improvement. So it’s not related to just large language models, although that was our main motivation.

HUIZINGA: Well, Jordan, let’s get your take on the last few questions here. As I’ve said before, the most interesting section of a research paper for me is the part where it says, “and what we found was …” So as a result of this research, what did you find? Were there outcomes that you expected, or were there any surprises?

ASH: I would say this paper is full of surprises. So as Dipendra was mentioning earlier, the LASER intervention removes information from a model. It doesn’t add information to a model. And up until now, there’s been a lot of work on pruning model parameters for a variety of reasons. But generally, these papers show that as parameters are removed from the model, performance just does not degrade. You can, overall, keep performance roughly the same even with a fairly drastic reduction of model parameters. And those reductions are typically done across layers of the model. What we’re showing here is surprising because we’re showing if we do a very targeted intervention, maybe at only one layer of the model, we could actually get a big boost in performance rather than just, you know, keep it the same or something like this.

HUIZINGA: Hmm. So with those results in mind, Jordan, I’m curious about practical applications. How would you say this research makes an impact in real-world situations? I know that Dipendra alluded to that earlier, but where is this most useful and who benefits most?

ASH: I think the short sales pitch for this technique is that you could potentially improve the performance of a language model with no additional training at all just by applying this intervention, which again just removes information from the model, so you don’t need to have any extra data on hand to refine the model or to add new information into it. The real-world situations we’re seeing a boost right now in LASER is for, like, question answering or reasoning-type tasks where there is, there’s, like, a concrete answer that corresponds to what you’re asking the LLM rather than just a, sort of, like, broad-purpose generative task.

HUIZINGA: So typically speaking, when you’re dealing with LLMs, part of the issue is prompt engineering. And it’s like my responsibility to be able to put the right words in it so I’ll get the best answer from the model, right? Are you saying that this helps me not have to be that good on the prompt-engineer end versus what the model can interpret and do?

ASH: I think prompt engineering still has a place in, sort of, eking out a good answer from a language model, but given a fixed prompt, this intervention seems to offer an improved accuracy over not intervening at all and applying the same prompt.

HUIZINGA: So, Jordan, I often think of an abstract as a sort of appetizer for a research paper. But let’s distill it even further. If there was one thing—sort of an amuse-bouche, if you will—that you want our listeners to take away from this work, what would it be?

ASH: For me, I like this idea of how, you know, typically if you want to get a model to perform better, you would take that model off the shelf and you would refine it on data related to the task at hand. And that might take the form of refining all of the parameters or doing some low-rank LoRA-type thing that Dipendra alluded to earlier. Here, we counterintuitively show that sometimes just carefully removing information from the model can have a positive effect, as well. And this is great news because refining a model requires a lot of new target domain data to be available, but removing information from the model doesn’t necessarily have that same constraint.

HUIZINGA: Well, finally, let’s talk a little bit about the future, Jordan, and I’ll have you close the show for us. What unanswered questions or ongoing research challenges do you see here, and what’s next maybe on your research agenda?

ASH: Yeah, I think there’s a lot of exciting future work for this project. I think for one, as a practical matter, there’s this question of just what’s the best way to find the best LASER intervention? LASER targets a specific layer of the model, and then it finds the extent by which it should be rank-reduced. That search procedure is, kind of, expensive. Right now, we’re doing it in a, sort of, exhaustive way. But also, it seems to be beneficial to apply LASER at multiple layers of the model. And that makes the search procedure, sort of, combinatorially explode. So finding out the best way to compose these interventions, I think, is an important area of future research. And then just, sort of, less on the practical side, I think there are all these questions related to just, why does this work at all? Like, why is it helpful to remove information from the model? And, you know, I think there are some rough ideas we have about this. For example, when you’re training a model on lots and lots of data, you know, it’s not all created equally. Some of it might be noisy or low quality, and some of it might be high quality. And maybe it’s better to remove those samples at training time to get a better model. So I guess there’s this question of, is pruning the model using a LASER-type intervention roughly equivalent to pruning the training data in a way to make it more favorable for eliciting a high-quality model? And again, like Dipendra alluded to earlier, this LoRA procedure, which does something that very much complements LASER and is often used to add information to a model, is it possible that LoRA is actually not just adding information but also removing information from the model? And perhaps that’s one reason why LASER seems to be so effective.

HUIZINGA: So lots of questions.

ASH: I would say so, yeah!

HUIZINGA: Well, Dipendra Misra, Jordan Ash, thanks for joining us today. And to our listeners, thanks for tuning in.

[MUSIC PLAYS]

Again, you can find a link to this paper at aka.ms/abstracts (opens in new tab) or on arXiv (opens in new tab). And I’ll also add that Dipendra will be speaking about this work at the upcoming Microsoft Research Forum, and you can register for this series of events at researchforum.microsoft.com (opens in new tab). See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: January 25, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of January 22, 2024

Research Focus: Week of January 22, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
January 22, 2024

Register for Microsoft Research Forum

Join Microsoft Research Forum (opens in new tab) for a continuous exchange of ideas about science and technology research in the era of general AI. This series, which begins on January 30, will explore recent research advances, bold new ideas, and important discussions with the global research community. Register now to receive access to all episodes in this quarterly series and be part of the conversation.


Improving Text Embeddings with Large Language Models

Text embeddings are vector representations of natural language that encode semantic information. They are widely used in various natural language processing tasks, such as information retrieval, question answering, semantic textual similarity, bitext mining, item recommendation, etc.

In a recent paper: Improving Text Embeddings with Large Language Models (opens in new tab), researchers from Microsoft introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods, this new method does not require building complex training pipelines or manually collected datasets that are often constrained by task diversity and language coverage. The researchers leverage proprietary large language models (LLMs) to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages. They then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that this method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, the model sets new state-of-the-art results on the BEIR (opens in new tab) and MTEB (opens in new tab) benchmarks.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


DevEx in Action: A study of its tangible impacts

For many professional software developers, the development lifecycle is riddled with friction and red tape, and successful delivery of code to production is a frustratingly infrequent event. Even worse, the problems are often compounded by a lack of management engagement, delaying and frustrating top engineers.

Developer experience (DevEx) is garnering increased attention at many organizations as leaders seek to optimize software delivery against a backdrop of fiscal tightening and transformational technologies such as AI. Developers and technical leaders generally understand that good DevEx leads to better products, more effective software delivery, and developer happiness. Yet, at many organizations, proposed initiatives and investments to improve DevEx struggle to get buy-in, as business stakeholders question the value proposition of improvements.

In a recent paper: DevEx in Action: A study of its tangible impacts (opens in new tab), researchers from Microsoft, GitHub, and DX (opens in new tab) examine this problem and present empirical evidence of how improvements in DevEx influence outcomes like productivity, code quality, and innovation.


The post Research Focus: Week of January 22, 2024 appeared first on Microsoft Research.

Read More

MetaOpt: Examining, explaining, and improving heuristic performance

MetaOpt: Examining, explaining, and improving heuristic performance

The MetaOpt workflow involves 4 steps (1) users encode the heuristic; (2) MetaOpt automatically does re-writes to obtain a single-level optimization; (3) it partitions the problem into smaller sub-problems to achieve scale; (4) it uses existing solvers to find the highest performance gap.

Heuristic algorithms, often referred to as heuristics, are tools used to approximate optimal algorithms to make faster and more efficient decisions. These are particularly useful in operational scenarios in production systems, such as determining which server a virtual machine should be assigned to or deciding whether data should be removed from a cache in a content delivery network.

However, cloud operators, who are responsible for designing and managing systems in production, often struggle to evaluate when and where their heuristics may underperform. This challenge can lead to over-provisioning the network and the inefficient use of available resources. Such practices can be costly and may result in the inability to meet customer demand.

To address this, we developed MetaOpt, a heuristic analyzer designed to enable operators to examine, explain, and improve heuristics’ performance before deploying them in critical, high-stakes environments. MetaOpt is unique because it not only compares algorithm performance but also provides insights into the underlying reasons for performance disparities between algorithms. It empowers operators and researchers to conduct “what-if” analyses, strategize how to combine heuristics in production, and understand why certain heuristics perform better in specific areas of the input space—the range of possible inputs that the heuristic may encounter.

We demonstrate MetaOpt’s capability for heuristic analysis by studying heuristics from three domains: traffic engineering, vector bin packing, and packet scheduling. MetaOpt identifies large performance gaps, enables us to prove properties about these heuristics, and guides us in improving them. Table 1 summarizes the results.

MetaOpt allowed us to (1) find the performance gap between heuristics from traffic engineering (TE), vector bin packing (VBP), and packet scheduling (PIFO); (2) prove various properties about the heuristic; and (3) design modificationsto improve their performance. DP refers to a heuristic Microsoft has deployed in our wide area network for traffic engineering.
Table 1. MetaOpt enabled us to find the performance gap between heuristics from traffic engineering, vector bin packing, and packet scheduling. It also helped us prove various properties about the heuristics. Finally, it helped us modify the heuristics to improve their performance. DP refers to a heuristic Microsoft has deployed in its wide area network for traffic engineering.

Currently, MetaOpt helps Azure operators analyze heuristics in production and serves as a “helper for theorem proving.” For example, we used MetaOpt to establish a tighter bound for the “first fit decreasing” heuristic in vector bin packing, a challenge for theoreticians for over three decades. As a result, we don’t need to over-provision resources in a cloud environment, ensuring we always have sufficient servers to meet customer demand.

MetaOpt framework

To use MetaOpt, users input the heuristic they want analyzed and either the optimal algorithm or another heuristic. MetaOpt efficiently translates these inputs into a solver format. It then finds performance gaps and the inputs that cause them. Recognizing that not all users are versed in optimization theory, we designed a higher-level abstraction for MetaOpt. This feature enables users to input their heuristics using a few simple building blocks and constrain the input space to what is relevant in practice. MetaOpt can then analyze decisions made by the heuristic that led to underperformance or identify input properties that caused the heuristic to make suboptimal choices. We illustrate the MetaOpt workflow in Figure 1.

The MetaOpt workflow involves 4 steps (1) users encode the heuristic; (2) MetaOpt automatically does re-writes to obtain a single-level optimization; (3) it partitions the problem into smaller sub-problems to achieve scale; (4) it uses existing solvers to find the highest performance gap.
Figure 1. The four steps in the MetaOpt workflow. (1) Users encode the heuristic; (2) MetaOpt automatically rewrites it to obtain a single-level optimization; (3) it divides the problem into smaller, more manageable segments for scalability; (4) it employs existing solvers to find the highest performance gap.

Rooted in game theory concepts

MetaOpt is based on Stackelberg games, a well-known class of leader-follower games in game theory. Here, the leader determines inputs for one or more followers, who must then optimize their outcomes based on these inputs. In the MetaOpt framework, the leader’s goal is to maximize the performance disparity between two algorithms (the followers) by deciding their inputs. The followers, representing the algorithms being compared, choose internal variables to optimize their outcomes. This, in turn, affects the leader’s results. We show this in Figure 2.

The stackelberg structure of MetaOpt
Figure 2. The high-level formulation of MetaOpt.

Looking ahead

MetaOpt marks a significant advance in the field of scalable, user-friendly analytical tools. It enables users to examine, understand, and explain differences in performance across competing algorithms. It also helps them improve those algorithms before deploying them in critical environments.

We began developing MetaOpt in early 2022 to address a specific need for heuristic analysis in our network’s traffic engineering solution. Since then, our focus has been on enhancing MetaOpt’s accessibility for users without a background in optimization theory. Currently, we are improving MetaOpt’s scalability and usability, and we are expanding the range of heuristics it supports. We plan to release it as an open-source tool at the USENIX Symposium on Networked Systems Design and Implementation (opens in new tab) (NSDI) conference, scheduled for April 16–18, 2024.

We believe that MetaOpt can significantly boost productivity for those studying or designing heuristics by serving as a risk-analysis engine and a tool for explainable AI and active learning. In the near future, we aim to publish papers on new MetaOpt applications and share our language for describing heuristics.

For more details, visit the MetaOpt webpage, and review our publications page for the latest developments.

The post MetaOpt: Examining, explaining, and improving heuristic performance appeared first on Microsoft Research.

Read More

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases

GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases

GHDDI name and logo on the left with a rainbow spectrum colored honeycomb on the right on a green and blue gradient background

The Global Health Drug Discovery Institute (GHDDI) (opens in new tab) and Microsoft Research recently achieved significant progress in accelerating drug discovery for the treatment of global infectious diseases. Working in close collaboration, the joint team successfully used generative AI and foundation models to design several small molecule inhibitors for essential target proteins of Mycobacterium tuberculosis and coronaviruses. These new inhibitors show outstanding bioactivities, comparable to or surpassing the best-known lead compounds.

This breakthrough is a testament to the team’s combined efforts in generative AI, molecular physicochemical modeling, and iterative feedback loops between scientists and AI technologies. Normally, the discovery and in vitro confirmation of such molecules could take up to several years, but with the acceleration of AI, the joint team achieved these new results in just five months. This research also shows the tremendous potential of AI for helping scientists discover or create the building blocks needed to develop effective treatments for infectious diseases that continue to threaten the health and lives of people around the world.

Since 2019, for example, there have been more than 772 million confirmed cases of COVID-19 worldwide and nearly 7 million deaths from the virus, according to the World Health Organization (WHO), the Centers for Disease Control, and various other sources. Although vaccines have reduced the incidence and deadliness of the disease, the coronavirus continues to mutate and evolve, making it a serious ongoing threat to global health. Meanwhile, the WHO reports that tuberculosis continues to be a leading cause of death among infectious diseases, second only to COVID-19 in 2022, when 10.6 million people worldwide fell ill with TB and the disease killed 1.3 million (the most recent figures currently available).

Laying the foundation for new infectious disease treatments

Microsoft Research has rich experience in developing and pre-training large AI models specialized for proteins and molecules, demonstrated in both property prediction and molecular generation. Based on those experiences, Microsoft Research developed and maintains ownership of an AI model for molecule generation tailored for specific protein targets. The generated compounds were virtually screened and further optimized by data scientists and medicinal chemists from GHDDI, followed by compound synthesis and wet-lab experiments to quantify bioactivities. The experimental results were then fed back to the research team at Microsoft for AI model improvement and new compound generation.

This AI-expert-experiment integrated pipeline enables the success of novel compound generation for protein targets in Mycobacterium tuberculosis and coronaviruses SARS-CoV-2. In less than five months, the joint team designed several chemical compounds that are effective in inhibiting these pathogens’ essential target proteins, accelerating the structure-based drug discovery process.

Figure 1. Two potential inhibitor compounds (generated by our method) for ClpP of tuberculosis.
Dose response curves of the compounds generated for coronavirus, with GRL0617 as the reference compound, demonstrating enhanced bioactivity. The most recent progress is that the joint team has effectively optimized the IC50 to 0.18uM, which is approximately an eight-fold improvement compared to GRL0617.
Dose response curves of the compounds generated for coronavirus, with GRL0617 as the reference compound, demonstrating enhanced bioactivity. The most recent progress is that the joint team has effectively optimized the IC50 to 0.18uM, which is approximately an eight-fold improvement compared to GRL0617.

One distinguishing feature of AI-generated molecules is their novel scaffold structures, which are important because they create the potential for these molecules to be developed into a new class of drug candidates. These novel structures offer the possibility of more effective treatments, and also help to address the escalating challenge of antimicrobial resistance (AMR), a major hurdle in treating infectious diseases like tuberculosis and COVID-19.

“In the current landscape of scientific research, we encounter unparalleled challenges but also have unprecedented opportunities,” said Dr. Sheng Ding, institute director of GHDDI. “Innovation stands as the central catalyst for scientific advancement and a crucial element in addressing global health challenges. I’m excited about our collaboration with Microsoft Research and gratified with the progress we’ve jointly achieved. Without a doubt, our combined efforts will enhance R&D efficiency and expedite the process of drug discovery.”

“This represents a collaboration that transcends disciplines and boundaries,” he noted. “Our combined strengths will advance pharmaceutical research, paving new avenues in scientific exploration. Going forward, we anticipate deploying such cutting-edge technologies in uncharted realms of life sciences. This will enable us to offer more comprehensive, profound, and practical solutions for global health issues.”

MICROSOFT RESEARCH PODCAST

Abstracts: October 23, 2023

On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.


Using AI to improve global health

Embracing the principle of open innovation, the collaboration between GHDDI and Microsoft Research is dedicated to harnessing AI technology to expedite drug discovery. The goal is to contribute to global health equity through the development of lifesaving medications and the prompt delivery of safer and more effective drug solutions that are accessible to everyone.  The collaboration focuses on infectious diseases that pose a threat to global health, including but not limited to tuberculosis, viral infections, and malaria. Both parties are committed to a deep integration of generative AI, foundational models, high-throughput virtual screening, and expert knowledge to tackle these challenges.

“Successful AI-driven drug discovery necessitates a tight-knit collaboration between AI specialists and medicinal experts,” said Dr. Tie-Yan Liu, distinguished scientist at Microsoft Research AI4Science. “In recent years, our globally recognized team at Microsoft Research has been deeply engaged in interdisciplinary research between AI and natural science. To complement this, GHDDI experts bring to the table a wealth of industry experience and profound domain knowledge. Their experimental facilities not only allow for testing but also help provide invaluable feedback for training AI models. Because of our close collaboration, we look forward to producing groundbreaking research outcomes with the potential to redefine the future of healthcare through AI technology innovation.”

Accelerating drug discovery

Commenting on the research into Mycobacterium tuberculosis and coronaviruses, Dr. Rumin Zhang, chief scientific officer at GHDDI, noted that the application of AI technology by the collaborative team managed to considerably reduce the traditionally lengthy drug discovery process. The team was able to design and validate highly effective small molecule inhibitors for the pathogens in just five months.

“This is an exceptional accomplishment that underscores the immense potential of AI in efficient de novo drug design. It also vividly illustrates the team’s exceptional innovative capacity and professional prowess,” he said. “We are excited about this innovative R&D strategy leading to more groundbreaking advancements in a broader spectrum of future drug discovery projects.”

“This work is all about pushing the boundaries of AI technology for application in new drug R&D,” said Dr. Tao Qin, senior principal researcher at Microsoft Research AI4Science “We aim to leverage AI innovations to enhance human health, tackle worldwide health issues, and ensure the advantages of AI technology are accessible to all.”

“We plan to intensify and broaden our collaboration, further advancing the use of AI technology in the realm of life sciences,” said Dr. Jinjiang Guo, head of the Data Science Department at GHDDI. “This will yield novel insights that will enrich researchers’ understanding of mechanisms underlying diseases and life, thus paving the way for the development of innovative treatment strategies and providing more effective solutions for diseases that have long affected human health. We are highly optimistic about the potential of this collaboration and are confident that it will have a substantial impact on the future of the healthcare field.”

Next steps

In the next phase, Microsoft Research and GHDDI will collaborate to optimize the discovered hit compounds, enhance ADMET properties, progress toward preclinical studies, and initiate a broader range of drug-discovery projects.

The post GHDDI and Microsoft Research use AI technology to achieve significant progress in discovering new drugs to treat global infectious diseases appeared first on Microsoft Research.

Read More

TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation

TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation

TaskWeaver chat user interface

The advent of large language models (LLMs) has revolutionized human-machine interactions, particularly in natural language understanding and generation applications. These AI- or LLM-backed virtual assistants hold the promise of serving as intelligent agents capable of autonomously reasoning, observing, and performing tasks articulated in natural language. However, it is still challenging for most agent frameworks to efficiently handle complex data structures (e.g., DataFrame), which are prevalent in data analytics tasks and domain-specific scenarios. 

To address these challenges, we introduce TaskWeaver – a code-first agent framework which can convert natural language user requests into executable code, with additional support for rich data structures, dynamic plugin selection, and domain-adapted planning process. Now publicly available as an open-source framework, TaskWeaver leverages LLMs’ coding capability to implement complex logic and incorporates domain-specific knowledge through customizable examples and plugins. TaskWeaver empowers users to easily build their own virtual assistants that understand diverse domain questions, follow examples, and execute customizable algorithms on complex data structures efficiently.

Motivating example – Anomaly detection on time-series data

Scenario

Amy is a business analyst who wants to identify anomalies on a time series of sales data stored in a SQL database. She would like to get help from an AI assistant for the task with natural language interactions. Moreover, Amy would like to apply her own definition and interpretation of anomalies in the context of sales data, including a customized anomaly detection algorithm. Figure 1 shows the desired conversation between the user and the AI assistant – the AI assistant should be able to first pull the data from target database, then apply desired algorithms, and return the visualized results.

Figure 1. A sample conversation between user and the AI assistant powered by taskweaver. The user asks to pull data from a database and apply anomaly detection. The ai assistant accomplishes the task by multi-rounds of communication with the user and finally plots a figure of anomalies.
Figure 1. Sample chat between the user and the AI assistant

Requirements for an agent framework

To accomplish the task above, we identify several key requirements that current agent frameworks may lack:

  • Plugin: The agent needs to first query and collect data from a database, and then detect anomalies using a specialized anomaly detection algorithm. Both require the capability to define and invoke custom plugins, e.g., the query_database plugin and the anomaly_detection plugin. 
  • Rich data structure: the agent should be capable of handling data in complex but common structures, such as array, matrix, tabular data (e.g., pandas DataFrame (opens in new tab)), to perform advanced data processing actions. Many existing works tend to transform the intermediate outputs as strings in the prompt or save them as local files before reading them again. However, this practice is error-prone and could easily exceed the prompt token limit. Additionally, data in rich structure should be able to transfer easily from one plugin to another.  
  • Stateful execution: The agent engages in iterative interactions with the user, processing user inputs and executing tasks accordingly. The execution states should be preserved throughout the entire conversation session across multiple chat rounds. 
  • Reasoning and acting (ReAct): The agent should have the capability to first observe and then act. The database might contain data of various schemas in the real world, leading to different arguments for anomaly detection. Therefore, the agent must first inspect the data schema, understand which columns are appropriate (and ask users to confirm), then feed the corresponding column names into the anomaly detection algorithm.
  • Arbitrary code generation: The agent should be able to generate code to accommodate ad-hoc user demands, which are not covered by the pre-defined plugins. In the example provided, the agent generates code to visualize the detected anomalies without using any plugins.  
  • Incorporating domain knowledge: The agent should provide a systematic way to incorporate domain-specific knowledge. It would help LLMs deliver better planning and accurate tool calling, which in turn produces reliable results, particularly in domain-specific scenarios.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft


TaskWeaver architecture

Figure 2 shows the three core components in the TaskWeaver architecture. The Planner serves as the system’s entry point and interacts with the user. Its responsibilities include: (1) planning – breaking down the user’s request into subtasks and managing the execution process with self-reflection; and (2) responding – transforming the execution result into a human-readable response for the user. The Code Interpreter consists of two components: the Code Generator generates code for a given subtask from the Planner, considering existing plugins and domain-specific task examples; the Code Executor is responsible for executing the generated code and maintaining the execution state throughout the entire session.

Figure 2. The architecture of taskweaver, which consists of three parts, planner, code generator and stateful code executor. They communicate with each other to accomplish user’s request.
Figure 2. Overview of TaskWeaver

Running workflow for motivating example

TaskWeaver has a two-layer planning process for dealing with user requests. In the first layer, the Planner generates a high-level plan outlining the steps required to fulfill the request. In each subsequent round, the code generator will devise a plan, in terms of chain-of-thought and generated code, to execute the specified step. Figure 3 presents the internal workflow of TaskWeaver when accomplishing the motivating example mentioned above. Note that the prompts shown in Figure 3 are simplified and do not represent the full complex instructions. 

The initial step involves the Planner taking the user query, Code Interpreter description, and planning examples (if provided) to generate a plan. For the given example, the plan first pulls data from the database and describes the data schema. The Code Generator prompt delineates its profile and competencies, providing definitions of all relevant plugins (e.g., function name, description, arguments and return values.) The output from the Code Generator is a code snippet that executes the sql_pull_data plugin, retrieves the data into a DataFrame, and provides a description of the data schema.

Next, the code generated is sent to the Code Executor for execution, after which the result is sent back to the Planner to determine the next planning step. In the example, the execution result reveals two columns, namely date and value, in the DataFrame. For the next step, the Planner can either confirm with the user if these columns correspond to the two input parameters of the anomaly_detection plugin, or directly proceed to the next step.

The workflow of taskweaver when dealing with the motivating example. The planner first generates a plan by incorporating planner description, code interpreter description and plugin definition. The first step of the plan is passed to the code generator to generate Python code and then forwarded to the code executor. After collecting execution results, the planner responds to the user.
Figure 3. Workflow of TaskWeaver

Key design considerations of TaskWeaver

  • Code-first analytics experience: TaskWeaver converts user requests into Python programs that run on dedicated processes, where the Python program can be plugin invocations, arbitrary code to handle ad-hoc user queries, or both. Unlike other frameworks that rely on text or file-based expressions, TaskWeaver can fully utilize native data structures such as pandas DataFrame and numpy ndarray that exist in the memory. This makes it easy to perform tasks such as pulling data from a database, running machine learning algorithms (e.g., anomaly detection, classification, or clustering), summarizing results, and visualizing analysis outcomes. 
  • Domain adaptation: Incorporating domain-specific knowledge into the model via prompts can help boost LLMs’ performance when the user query is complex. TaskWeaver provides two options to make customizations with the user’s domain knowledge:
    • Customization with plugins: Users can define custom plugins (including Python implementation and schema) to incorporate domain knowledge, such as pulling data from a specific database, and running a dedicated algorithm.
    • Customization with examples: TaskWeaver also provides an easy-to-implement interface (in YAML format) for users to configure examples to teach the LLMs how to respond to certain requests. The examples can be of two categories: one is used for planning and the other is for code generation.
  • Stateful code execution: When users make ad-hoc requests for data analytics, it often involves multiple iterations. As a result, TaskWeaver maintains the state of code execution throughout the entire session. This is like programming in Python using the Jupyter Notebook, where users type code snippets in a sequence of cells and the program’s internal state progresses sequentially. The difference in TaskWeaver is that users use natural language instead of programming language. TaskWeaver converts each user request into one or more code snippets in each round, depending on the specific plan. 
  • Others: TaskWeaver also supports other features such as intelligent plan decomposition and self-reflection to respond to a user’s request in a more reliable and organized manner. Moreover, features like restricted code generation can help limit the capabilities of the generated code to reduce security risks.

Getting started

TaskWeaver (opens in new tab) is now publicly available on GitHub. You may run the following commands to quickly get started.

git clone https://github.com/microsoft/TaskWeaver.git 
cd TaskWeaver 
pip install -r requirements.txt   # install the requirements 

Once the installation is finished, users can configure key parameters, such as the LLM endpoint, key and model, and start the TaskWeaver service easily by following the running examples (opens in new tab).

Other resources

The post TaskWeaver: A code-first agent framework for efficient data analytics and domain adaptation appeared first on Microsoft Research.

Read More

Advancing transparency: Updates on responsible AI research

Advancing transparency: Updates on responsible AI research

Editor’s note: All papers referenced here represent collaborations throughout Microsoft and across academia and industry that include authors who contribute to Aether, the Microsoft internal advisory body for AI ethics and effects in engineering and research.

Blue green gradient with a lightbulb icon centered. Six icons surround it: (left to right) group of people, eyeball, shaking hands, set of scales, lock, and shield.

A surge of generative AI models in the past year has fueled much discussion about the impact of artificial intelligence on human history. Advances in AI have indeed challenged thinking across industries, from considering how people will function in creative roles to effects in education, medicine, manufacturing, and more. Whether exploring impressive new capabilities of large language models (LLMs) such as GPT-4 or examining the spectrum of machine learning techniques already embedded in our daily lives, researchers agree on the importance of transparency. For society to appropriately benefit from this powerful technology, people must be given the means for understanding model behavior.  

Transparency is a foundational principle of responsible, human-centered AI and is the bedrock of accountability. AI systems have a wide range of stakeholders: AI practitioners need transparency for evaluating data and model architecture so they can identify, measure, and mitigate potential failures; people using AI, expert and novice, must be able to understand the capabilities and limitations of AI systems; people affected by AI-assisted decision-making should have insights for redress when necessary; and indirect stakeholders, such as residents of cities using smart technologies, need clarity about how AI deployment may affect them.

Providing transparency when working with staggeringly complex and often proprietary models must take different forms to meet the needs of people who work with either the model or the user interface. This article profiles a selection of recent efforts for advancing transparency and responsible AI (RAI) by researchers and engineers affiliated with Aether, the Microsoft advisory body for AI ethics and effects in engineering and research. This work includes investigating LLM capabilities and exploring strategies for unlocking specialized-domain competencies of these powerful models while urging transparency approaches for both AI system developers and the people using these systems. Researchers are also working toward improving identification, measurement, and mitigation of AI harms while sharing practical guidance such as for red teaming LLM applications and for privacy-preserving computation. The goal of these efforts is to move from empirical findings to advancing the practice of responsible AI.

Demo Video for

Toward user-centered algorithmic recourse

In this demo of GAM Coach, an example of an AI transparency approach, an interactive interface lets stakeholders in a loan allocation scenario understand how a model based its prediction and what factors they can change to meet their goals.

Related papers

Identifying harms in LLMs and their applications

The sociotechnical nature of AI is readily apparent as product teams sprint to integrate the power and appeal of LLMs into conversational agents and productivity tools across domains. At the same time, recent accounts, such as a lawyer unwittingly submitting generative AI’s fictitious legal citations in a brief to the court (opens in new tab) or unsettling demonstrations of deepfakes, reveal the ample opportunity for misunderstanding these models’ capabilities and, worse yet, for deliberately misusing them.  

Envisioning what could go wrong with an AI system that has not yet been deployed is the first step toward responsible AI. Addressing this challenge, researchers introduce AHA! (anticipating harms of AI), a human-AI collaboration for systematic impact assessment. This framework enables people to make judgments about the impact of potential deployment on stakeholders. It uses an LLM to generate vignettes, or fictional scenarios, that account for an ethical matrix of problematic AI behaviors or harms. Evaluation of this framework in a variety of decision-making contexts found it surfaced a broader range of potential harmful outcomes than either people or LLMs could singly envision.

An illustration of a guide for red teaming LLMs comprising a network icon alongside an icon of a bulleted list.

AI practitioners can follow this planning guide to help them set up and manage red teaming for large language models (LLMs) and their applications. Based on firsthand experience of testing LLMs to identify potentially harmful outputs and plan for mitigation strategies, this guide provides tips for who should test, what to test, and how to test, plus pointers for recording the red-teaming data.

Responsible AI red teaming, or probing models and their applications to identify undesirable behaviors, is another method of harm identification. Microsoft has shared a practical guide for the RAI red teaming of LLMs and their applications, and automated tools for RAI red teaming are beginning to emerge. Although the vital task of impact assessment and testing for failures can be facilitated by LLMs helping with creative brainstorming, researchers emphasize that for AI to be human centered, such efforts should never be fully automated. To improve human-AI complementarity in red teaming, AdaTest++ builds on an existing tool that uses an LLM to generate test suggestions as it adapts to user feedback. The redesign offers greater human control for testing hypotheses, enabling editing and exploration of counterfactuals, and conducting in-depth testing across a broad diversity of topics.

ALT TEXT: An illustration of the LLM auditing tool AdaTest++ comprising a pointed finger icon hovering over an interactive interface. The GitHub logo is included to indicate the tool’s open-source availability.
Researchers invite AI practitioners working toward responsible AI to use and contribute to AdaTest++ (opens in new tab), which leverages human-AI complementarity to audit LLMs. Augmenting a preexisting tool, AdaTest++ offers prompt templates and introduces greater human control for testing hypotheses and exploring counterfactuals.

In AI privacy, researchers demonstrate how prompt-tuning can be used to infer private information from an email system using a language model to provide autocompleted replies. In sharing their red-teaming technique, they encourage privacy-enhancing efforts for applications using language models and take the stance that transparency of publicly detailing a model’s vulnerability is an essential step toward adversarial robustness. 

Identifying and exposing security vulnerabilities is a top concern, especially when these can seep into AI-generated code. The integration of LLMs for AI-assisted coding has reduced the entry barrier for novice programmers and increased productivity for veteran coders. But it is important to examine the reliability and safety of AI-assisted coding. Although static analysis tools can detect and remove insecure code suggestions caused by the adversarial manipulation of training data, researchers introduce two novel techniques for poisoning code-suggestion models that bypass static analysis mitigation: Covert inserts malicious code in docstrings and comments, while TrojanPuzzle tricks the transformer-based model into substituting tokens, giving the programmer harmless-looking but insecure code. Exposing these vulnerabilities, researchers call for new methods for training code-suggestion models and for processes to ensure code suggestions are secure before programmers ever see them.

Related papers

Transparency for improving measurement and its validity

We can’t begin to mitigate for the possibility of AI failures without first identifying and then measuring the potential harms of a model’s outputs, transparently examining who may or may not benefit or what could go wrong and to what extent. 

A framework for automating the measurement of harms at speed and scale has two LLMs simulate product- or end-user interaction and evaluate outputs for potential harms, using resources created by relevant sociotechnical-domain experts. As researchers stress, the validity and reliability of such evaluation rely strictly on the quality of these resources—the templates and parameters for simulating interactions, the definition of harms, and their annotation guidelines. In other words, sociotechnical-domain expertise is indispensable.

Measurement validity—ensuring metrics align with measurement goals—is central to the practice of responsible AI. Model accuracy in and of itself is not an adequate metric for assessing sociotechnical systems: for example, in the context of productivity applications, capturing what is valuable to an individual using an AI system should also be taken into account. How do we identify metrics appropriate for context-dependent models that are deployed across domains to serve a variety of populations and purposes? Teams need methods to address measurement and mitigation for every deployment scenario.  

Language models illustrate the maxim “context is everything.” When it comes to measuring and mitigating fairness-related harms that are context-dependent in AI-generated text, there’s generally not enough granularity in dataset labeling. Lumping harms under generalized labels like “toxic” or “hate speech” doesn’t capture the detail needed for measuring and mitigating harms specific to various populations. FairPrism is a new dataset for detecting gender- and sexuality-related harms that makes a case for greater granularity in human annotation and transparency in dataset documentation, including identifying groups of people that may be targeted. Researchers situate FairPrism as “a recipe” for creating better-detailed datasets for measuring and mitigating AI harms and demonstrate how the new dataset’s 5,000 examples of English text can probe for fairness-related harms to a specific group.

ALT TEXT: An illustration of the dataset FairPrism comprising an icon of a database stack with an arrow pointing to an icon of a laptop overlaid with an icon of a network. The GitHub logo is included to indicate the dataset’s open-source availability.
AI practitioners can request access to the FairPrism dataset (opens in new tab) for detecting gender- and sexuality-related harms in AI-generated text. FairPrism makes a case for greater granularity in human annotation.

Similarly, researchers deepen the conversation around representational harms in automated image-tagging systems, voicing the need for improved transparency and specificity in taxonomies of harms for more precision in measurement and mitigation. Image tagging is generally intended for human consumption, as in alt text or online image search, differentiating it from object recognition. Image tagging can impute fairness-related harms of reifying social groups as well as stereotyping, demeaning, or erasure. Researchers identify these four specific representational harms and map them to computational measurement approaches in image tagging. They call out the benefits of increased granularity but note there is no silver bullet: efforts to mitigate by adding or removing particular tags to avoid harms may in fact introduce or exacerbate these representational harms.

Related papers

Transparency and UX-based mitigations: What designers need and end users want

Prioritizing what people value and designing for the optimal user experience (UX) is a goal of human-centered, responsible AI. Unfortunately, UX design has often been viewed as a secondary consideration in engineering organizations. But because AI is a sociotechnical discipline, where technical solutions must converge with societal perspectives and social science theory, AI not only brings UX expertise to the foreground but also positions designers as potential innovators, well situated to mitigate some harms and model failures with UX interventions. To realize this, UX designers need transparency—visibility into how models work—so they can form “designerly understanding of AI” to help them ideate effectively. A study of 23 UX designers completing a hands-on design task illustrates their need for better support, including model documentation that’s easier to understand and interactive tools to help them anticipate model failures, envision mitigations, and explore new uses of AI.   

People with varying levels of AI experience or subject-matter expertise are suddenly harnessing commercially available generative AI copilots for productivity gains and decision-making support across domains. But generative AI can make mistakes, and the impact of these failures can differ greatly depending on the use case: for example, poor performance in a creative-writing task has a very different impact than an error in a health care recommendation. As the stakes rise, so does the call for mitigating these failures: people need tools and mechanisms that help them audit AI outputs. UX interventions are well suited for mitigating this type of harm. To begin, researchers propose a taxonomy of needs that co-auditing systems should address when helping people double-check generative AI model responses. Basic considerations should include how easy it is for individuals to detect an error, what their skill level is, and how costly or consequential an error may be in a given scenario. A prototype Excel add-in illustrates these considerations, helping the nonprogrammer inspect the accuracy of LLM-generated code.

There are productivity dividends to paying attention to people’s need and desire for transparency. A central problem people encounter with language models is crafting prompts that lead to useful output. Advancing a solution for this in LLM-based code generation, researchers demonstrate an interface that gives people visibility into how the model maps their natural language query to system action. This transparency approach helps people adapt their mental model of the code generator’s capabilities and modify their queries accordingly. Findings of the user study, which included participants with low expertise in coding, showed this transparency approach promoted user confidence and trust while facilitating explanation and debugging.  Similarly, human-centered efforts such as modeling the timing of when a programmer finds it most valuable to receive a code suggestion emphasize the primacy of end users’ needs when addressing productivity. 

“What it wants me to say” demo video

“What It Wants Me To Say”

This transparency approach provides nonexpert programmers with an interface that gives visibility into how a language model maps their natural language query to system action, helping them adapt their mental model and modify their prompts.

For experienced coders to be confident and benefit from AI-assisted code completion, they need to be able to easily spot and correct errors and security vulnerabilities. In the first empirical study of the effectiveness of token highlighting for communicating uncertainty of an AI prediction, researchers examine a UX technique that draws programmers’ attention in a way similar to a spell checker. Highlighting tokens that had the highest predicted likelihood of being edited resulted in programmers being able to complete tasks faster with better-targeted edits. Participants also desired more transparency in the form of explanations to help with diagnosing uncertainty and suggested interaction designs that would improve their efficiency and give them control.

Communicating the uncertainty of AI predictions in a way that is meaningful is a design challenge in every deployment context. How to provide transparency via explanations remains a conundrum—studies have shown that the mere presence of explanations can increase overreliance on AI. Designing UX that helps people meet their decision-making goals with confidence requires understanding their perceptions of how a given system works. But little is actually known about the processes decision-makers go through when debating whether to rely on an AI system’s output versus their own intuition. Conducting a think-aloud study for insights into the role of human intuition in reliance on AI predictions in AI-assisted decision making, researchers identified three types of intuition people use in deciding to override the system. While performing brief tasks of income prediction and biography classification with AI support, participants expressed “gut feel” about the decision outcome; how specific data characteristics, or features, may impact explanations; and the limitations of the AI system. Findings suggested what the authors call “intuition-driven pathways” to understanding the effect of different types of explanations on people’s decision to override AI. Results showed that example-based explanations, which were textual narratives, better aligned with people’s intuition and reasoning about a prediction than feature-based explanations, which conveyed information with bar charts and other visual tools. At the same time, participants echoed the familiar desire for help with understanding AI systems’ limitations. Suggestions included interface designs to better support transparency and user understanding—for example, interactive explanations that enable people to change attributes to explore the effect on the model’s prediction.

Accommodating varying levels of user expertise is a growing AI UX design challenge across domains and applications. For example, in business, people with limited knowledge of AI or statistics must increasingly engage AI visual-analytic systems to create reports and inform recommendations. While research seeks to address gaps in knowledge for improving user interaction with AI, some practical and evidence-driven tools are already available. A case study of business experts with varying levels of AI proficiency demonstrates the effectiveness of applying existing guidelines for human-AI interaction for transparency cues. Visual explanations improved participants’ ability to use a visual AI system to make recommendations. At the same time, researchers noted a high level of trust in outputs regardless of participants’ understanding of AI, illustrating the complexity of AI transparency for appropriate trust.

Related papers

ALT TEXT: An illustration of transparency considerations in the age of large language models. A network icon labeled “baseline LLM”; network and gear icons labeled “adapted LLM”; and an icon incorporating a keyboard, network, and pointed finger labeled “LLM-powered applications” represent considerations from the technology perspective. An icon of people labeled “stakeholders” represents considerations from the stakeholder perspective.
Transparency for responsible AI is a sociotechnical endeavor. From the technology perspective, organizations developing LLMs and their applications need to consider how to appropriately characterize and communicate capabilities, limitations, performance, and risks and at what level transparency approaches should take place. To meet the needs of stakeholders—the wide range of people developing, deploying, using, or being impacted by LLMs—transparency considerations should include stakeholder goals; improving people’s mental models of LLMs and supporting appropriate trust; and gaining insight to how transparency can contribute to better control mechanisms for LLMs. (Adapted from AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap

Transparency: A means for accountability in a new era of AI

This research compilation highlights that transparency is fundamental to multiple components of responsible AI. It requires, among other things, the understanding and communication of datasets and their composition and of model behaviors, capabilities, and limitations. Transparency also touches every aspect of the responsible AI harm mitigation framework: identify, measure, mitigate. Furthermore, this research establishes a primary role for UX in mitigating harms as AI integrates into the apps people rely on every day in their personal and professional lives.     

As authors of a research roadmap for transparency in the age of LLMs outline, these complex models’ massive datasets, nondeterministic outputs, adaptability, and rapid evolutionary pace present new challenges for deploying AI responsibly. There’s much work to be done to improve transparency for stakeholders of highly context-dependent AI systems—from improving how we publish the goals and results of evaluations when it comes to model reporting to providing appropriate explanations, communicating model uncertainty, and designing UX-based mitigations.  

Prioritizing transparency in the design of our AI systems is to acknowledge the primacy of people, whom the technology is meant to serve. Transparency plays a critical role in respecting human agency and expertise in this new frontier of human-AI collaboration and, ultimately, can hold us accountable for the world we are shaping.

Eric Horvitz presenting the KDD2023 Keynote: People and Machines: Pathways to Deeper Human-AI Synergy

Pathways to deeper human-AI synergy

In his KDD 2023 keynote, Microsoft Chief Scientific Officer Eric Horvitz presents an overview of the power of LLM capabilities and the potential for enriching human-AI complementarity.

Related papers

MICROSOFT RESEARCH PODCAST

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

In this episode, PhD students Jennifer Scurrell and Alejandro Cuevas talk to Senior Researcher Dr. Madeleine Daepp. They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers to the teamwork they say helped make it possible for them to succeed, and the impact they hope to have with their work.


The post Advancing transparency: Updates on responsible AI research appeared first on Microsoft Research.

Read More

Research Focus: Week of January 8, 2024

Research Focus: Week of January 8, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus - week of January 8, 2024

Mixture-of-Linear-Experts for Long-term Time Series Forecasting 

Long-term time series forecasting (LTSF), which aims to predict future values of a series given past values, is an important problem in the machine learning community. It’s useful in areas like weather modeling, traffic flow prediction, and financial forecasting.  

The current state of the art on LTSF is attained in some cases by linear-centric models. However, real-world time series are usually nonstationary. For example, traffic patterns change on different days of the week. The inherent simplicity of linear-centric models makes them unable to capture these patterns. In a recent paper: Mixture-of-Linear-Experts for Long-term Time Series Forecasting, researchers from Microsoft and external colleagues propose Mixture-of-Linear-Experts (MoLE) to address this problem. Instead of training a single model, MoLE trains multiple linear-centric models (i.e., experts) and a router model that weighs and mixes their outputs. While the entire framework is trained end-to-end, each expert learns to specialize in a specific temporal pattern, and the router model learns to compose the experts adaptively. Experiments show that MoLE significantly reduces forecasting error of linear-centric models, and MoLE outperforms state-of-the-art transformer-based approaches in 68% of settings.

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

End-to-end (E2E) models are the dominant model structure in automatic speech recognition (ASR) and speech translation (ST). This has led to efforts to develop a unified E2E model for multilingual ASR and multilingual ST tasks. Streaming ASR and ST tasks have extensively utilized neural transducers in the past.  

In a recent paper: A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability, researchers from Microsoft present a streaming multilingual speech model – $SM^2$ – which employs a single neural transducer model for transcribing or translating multiple languages into target languages. $SM^2$ is trained using weakly supervised data created by converting speech recognition transcriptions with a machine translation model. Leveraging 351,000 hours of speech training data from 25 languages, $SM^2$ achieves impressive ST performance. Notably, no human-labeled ST data was employed during training. It was purely weakly supervised ST data generated by converting 351,000 hours of anonymized ASR data from 25 languages using text based machine translation service. 

The researchers also demonstrate the truly zero-shot capability of $SM^2$ when expanding to new target languages, generating high-quality zero-shot ST translation for {source-speech, target-text} pairs that were not seen during training.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.


KBFormer: A Diffusion Model for Structured Entity Completion

Deep generative models include large language models (LLMs) for text, plus models for other modalities, such as for vision and audio. In a recent paper: KBFormer: A Diffusion Model for Structured Entity Completion, researchers from Microsoft and external colleagues explore generative modeling of structured entities with heterogeneous properties, such as numerical, categorical, string, and composite. This includes entries in rich knowledge bases (KBs), items in product catalogs or scientific catalogs, and ontologies like the periodic table of elements and the various properties of isotopes. 

Their approach handles such heterogeneous data through a mixed continuous-discrete diffusion process over the properties, using a flexible framework that can model entities with arbitrary hierarchical properties. Using this approach, the researchers obtain state-of-the-art performance on a majority of cases across 15 datasets. In addition, experiments with a device KB and a nuclear physics dataset demonstrate the model’s ability to learn representations useful for entity completion in diverse settings. This has many downstream use cases, including modeling numerical properties with high accuracy – critical for science applications, which also benefit from the model’s inherent probabilistic nature.

A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers

People are increasingly interacting with, and being affected by, the deployment of AI systems in the workplace. This is a pressing matter for system designers, policy-makers, and workers themselves, which researchers from Microsoft address in a recent paper: A Framework for Exploring the Consequences of AI-Mediated Enterprise Knowledge Access and Identifying Risks to Workers.  

Organizations generate huge amounts of information that raise challenges associated with the maintenance, dissemination, and discovery of organizational knowledge. Recent developments in AI, notably large language models (LLMs), present a shift in what is possible in this domain. Recent advances could enable more extensive mining, knowledge synthesis, and natural language interaction in relation to knowledge.  

The researchers propose the Consequence-Mechanism-Risk Framework to identify risks to workers associated with deploying AI-mediated enterprise knowledge access systems. The goal is to support those involved in the design and/or deployment of such systems to identify the risks they introduce, the specific system mechanisms that introduce those risks, and the actionable levers to reduce those risks.

Large Search Model: Redefining Search Stack in the Era of LLMs

Modern search engines are built on a stack of different components, including query understanding, retrieval, multi-stage ranking, and question answering, among others. These components are often optimized and deployed independently. In a recent paper: Large Search Model: Redefining Search Stack in the Era of LLMs, researchers from Microsoft introduce a novel conceptual framework called large search model, which redefines the conventional search stack by unifying search tasks with one large language model (LLM). All tasks are formulated as autoregressive text generation problems, allowing for the customization of tasks through the use of natural language prompts. This proposed framework capitalizes on the strong language understanding and reasoning capabilities of LLMs, offering the potential to enhance search result quality while simplifying the cumbersome search stack. To substantiate the feasibility of this framework, the researchers present a series of proof-of-concept experiments and discuss the potential challenges associated with implementing this approach within real-world search systems.

The post Research Focus: Week of January 8, 2024 appeared first on Microsoft Research.

Read More

Splitwise improves GPU usage by splitting LLM inference phases

Splitwise improves GPU usage by splitting LLM inference phases

The recent surge in large language model (LLM) use is causing significant challenges for cloud providers, requiring them to deploy more GPUs at an unprecedented rate. However, the capacity to provision the power needed to run these GPUs is limited, and with demand for computation surpassing supply, it is not uncommon for user queries to be denied. Therefore, any approach to making the existing infrastructure more efficient—enabling it to serve more queries faster under the same power budget—can have very tangible benefits to both cloud providers and users.

One aspect of LLM inference that currently limits efficient use of resources is that it has two distinct phases with different characteristics: the prompt phase and the token-generation phase. During the prompt phase, LLMs process all user input, or prompts, in parallel, efficiently utilizing GPU compute. However, during the token-generation phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. Even when employing state-of-the-art batching mechanisms, the discrepancy between these two phases results in low overall hardware utilization, leading to much higher costs when offering LLMs to users. Figure 1 illustrates the differences between these two phases.

An example of the generative LLM inference process and the two phases associated with it. The initial prompt is “Which is better, pizza or burger?” and it generates the word “Pizza”. The token generation phase generates the words/tokens: “is”, “better”, and “.”. The prompt phase has the following properties: (1) all input tokens are processed in parallel to generate the first output token, (2) compute intensive, and (3) is a smaller part of the end-to-end latency. The token phase is: (1) serialized, (2) memory intensive, and (3) tends to be the majority of the end-to-end latency.
Figure 1. An example of the generative LLM inference process and the two phases associated with it. The prompt phase is computationally intensive, while the token phase is memory intensive.

Splitting the phases with Splitwise

At Azure Research – Systems, we tackled this by creating Splitwise, a technique designed to optimally utilize available hardware by separating the prompt computation and token-generation phases onto separate machines. This approach is underpinned by the insight that prompt processing and token-generation are distinct in their computational, memory, and power requirements. By separating these two phases, we can enhance hardware utilization during both phases. Our paper, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” details our methods for developing and testing this technique, including an exploration of how different types of GPUs perform during each phase.   

To create a sustainable approach for GPU provisioning, we used Splitwise to design GPU clusters with three primary objectives: maximizing throughput, minimizing costs, and reducing power. In addition to separating the two LLM inference phases into two distinct machine pools, we include a third machine pool for mixed batching across the prompt and token phases, sized dynamically based on real-time computational demands. Lastly, we transferred the state context (i.e., KV-cache in the LLM transformer attention layers) from the prompt to the token machines over InfiniBand without any perceivable latency impact to the user. This high-level system architecture is illustrated in Figure 2.

A high-level diagram of Splitwise architecture. Machines maintained in different pools are dedicated to the corresponding phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.
Figure 2. A high-level diagram of the Splitwise architecture. Machines maintained in different pools are dedicated to the two distinct LLM inference phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.

MICROSOFT RESEARCH PODCAST

Abstracts: October 23, 2023

On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.


Tests show Splitwise maximizes throughput while lowering costs

To evaluate its performance, we used Splitwise to design clusters with different types of GPUs, including NVIDIA DGX-A100 and DGX-H100, while optimizing cost, power, and throughput under specific latency service level agreements (SLAs) for each query. Table 1 shows the machine types we used for each cluster design. Our application of Splitwise encompassed two use cases: code and conversation using the Llama-2-70B (opens in new tab) and BLOOM-176B (opens in new tab) LLMs.

Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.
Table 1. Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.

Our findings demonstrate that Splitwise successfully achieves our three goals of maximizing throughput, minimizing costs, and reducing power. Through our evaluation, we observed that the Splitwise cluster design can maximize throughput at the same cost compared with an A100 baseline cluster. Moreover, Splitwise delivers much higher throughput while operating within the same provisioned power constraints as the baseline cluster. Figure 3 shows that compared with Baseline-H100, we can achieve 1.4x higher throughput at 20 percent lower cost. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints. Splitwise-HH requires the least number of machines. Splitwise-HHcap provides the best throughput. Splitwise-AA is the cheapest option.
Figure 3. Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints.

Looking forward

Splitwise marks a leap toward efficient, high-performance LLM deployments. By separating the prompt and token phases, we can unlock new potential in GPU use. Looking forward, we at Microsoft Azure envision tailored machine pools driving maximum throughput, reduced costs, and power efficiency, and we will continue to focus on making LLM inference efficient and sustainable.

Our approach is now part of vLLM (opens in new tab) and can also be implemented with other frameworks.

Acknowledgements

This work was done in collaboration with our intern, Pratyush Patel from the University of Washington. We also appreciate the help and guidance of Suriya Kalivardhan, Gopi Kumar, and Chetan Bansal.

The post Splitwise improves GPU usage by splitting LLM inference phases appeared first on Microsoft Research.

Read More

Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries

Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries

It isn’t often that researchers at the cutting edge of technology see something that blows their minds. But that’s exactly what happened in 2023, when AI experts began interacting with GPT-4, a large language model (LLM) created by researchers at OpenAI that was trained at unprecedented scale. 

“I saw some mind-blowing capabilities that I thought I wouldn’t see for many years,” said Ece Kamar, partner research manager at Microsoft, during a podcast recorded in April.

Throughout the year, rapid advances in AI came to dominate the public conversation (opens in new tab), as technology leaders and eventually the general public voiced a mix of wonder and skepticism after experimenting with GPT-4 and related applications. Could we be seeing sparks of artificial general intelligence (opens in new tab)—informally defined as AI systems that “demonstrate broad capabilities of intelligence, including reasoning, planning, and the ability to learn from experience (opens in new tab)”? 

While the answer to that question isn’t yet clear, we have certainly entered the era of AI, and it’s bringing profound changes to the way we work and live. In 2023, AI emerged from the lab and delivered everyday innovations that anyone can use. Millions of people now engage with AI-based services like ChatGPT. Copilots (opens in new tab)—AI that helps with complex tasks ranging from search to security—are being woven into business software and services.

Underpinning all of this innovation is years of research, including the work of hundreds of world-class researchers at Microsoft, aided by scientists, engineers, and experts across many related fields. In 2023, AI’s transition from research to reality began to accelerate, creating more tangible results than ever before. This post looks back at the progress of the past year, highlighting a sampling of the research and strategies that will support even greater progress in 2024.

Strengthening the foundations of AI

AI with positive societal impact is the sum of several integral moving parts, including the AI models, the application of these models, and the infrastructure and standards supporting their development and the development of the larger systems they underpin. Microsoft is redefining the state of the art across these areas with improvements to model efficiency, performance, and capability; the introduction of new frameworks and prompting strategies that increase the usability of models; and best practices that contribute to sustainable and responsible AI. 

Advancing models 

  • Researchers introduced Retentive Networks (RetNet), an alternative to the dominant transformer architecture in language modeling. RetNet supports training parallelism and strong performance while making significant gains in inference efficiency. 
  • To contribute to more computationally efficient and sustainable language models, researchers presented a 1-bit transformer architecture called BitNet.
  • Microsoft expanded its Phi family of small language models with the 2.7 billion-parameter Phi-2, which raises the bar in reasoning and language understanding among base models with up to 13 billion parameters. Phi-2 also met or exceeded the performance of models 25 times its size on complex benchmarks.
  • The release of the language models Orca (13 billion parameters) and, several months later, Orca 2 (7 billion and 13 billion parameters) demonstrates how improved training methods, such as synthetic data creation, can elevate small model reasoning to a level on par with larger models.
  • For AI experiences that more closely reflect how people create across mediums, Composable Diffusion (CoDi) takes as input a mix of modalities, such as text, audio, and image, and produces multimodal output, such as video with synchronized audio.
  • To better model human reasoning and speed up response time, the new approach Skeleton-of-Thought has LLMs break tasks down into two parts—creating an outline of a response and providing details on each point in parallel.

Advancing methods for model usage

  • AutoGen is an open-source framework for simplifying the orchestration, optimization, and automation of LLM workflows to enable and streamline the creation of LLM-based applications.
  • Medprompt, a composition of prompting strategies, demonstrates that with thoughtful and advanced prompting alone, general foundation models can outperform specialized models, offering a more efficient and accessible alternative to fine-tuning on expert-curated data.
  • The resource collection promptbase offers prompting techniques and tools designed to help optimize foundation model performance, including Medprompt, which has been extended for application outside of medicine.
  • Aimed at addressing issues associated with lengthy inputs, such as increased response latency, LLMLingua is a prompt-compression method that leverages small language models to remove unnecessary tokens.

Developing and sharing best practices 

Accelerating scientific exploration and discovery

Microsoft uses AI and other advanced technologies to accelerate and transform scientific discovery, empowering researchers worldwide with leading-edge tools. Across global Microsoft research labs, experts in machine learning, quantum physics, molecular biology, and many other disciplines are tackling pressing challenges in the natural and life sciences.

  • Because of the complexities arising from multiple variables and the inherently chaotic nature of weather, Microsoft is using machine learning to enhance the accuracy of subseasonal forecasts.
  • Distributional Graphormer (DIG) is a deep learning framework for predicting protein structures with greater accuracy, a fundamental problem in molecular science. This advance could help deliver breakthroughs in critical research areas like materials science and drug discovery.
  • Leveraging evolutionary-scale protein data, the general-purpose diffusion framework EvoDiff helps design novel proteins more efficiently, which can aid in the development of industrial enzymes, including for therapeutics.
  • MOFDiff, a coarse-grained diffusion model, helps scientists refine the design of new metal-organic frameworks (MOFs) for the low-cost removal of carbon dioxide from air and other dilute gas streams. This innovation could play a vital role in slowing climate change.
  • This episode of the Microsoft Research Podcast series Collaborators explores research into renewable energy storage systems, specifically flow batteries, and discusses how machine learning can help to identify compounds ideal for storing waterpower and advancing carbon capture.
  • MatterGen is a diffusion model specifically designed to address the central challenge in materials science by efficiently generating novel, stable materials with desired properties, such as high conductivity for lithium-ion batteries.
  • Deep learning is poised to revolutionize the natural sciences, enhancing modeling and prediction of natural occurrences, ushering in a new era of scientific exploration, and leading to significant advances in sectors ranging from drug development to renewable energy. DeepSpeed4Science, a new Microsoft initiative, aims to build unique capabilities through AI system technology innovations to help domain experts unlock today’s biggest science mysteries. 
  • Christopher Bishop, Microsoft technical fellow and director of the AI4Science team, recently published Deep Learning: Foundations and Concepts, a book that “offers a comprehensive introduction to the ideas that underpin deep learning.” Bishop discussed the motivation and process behind the book, as well as deep learning’s impact on the natural sciences, in the AI Frontiers podcast series.

Maximizing the individual and societal benefits of AI

As AI models grow in capability so, too, do opportunities to empower people to achieve more, as demonstrated by Microsoft work in such domains as health and education this year. The company’s commitment to positive human impact requires that AI technology be equitable and accessible.

Beyond AI: Leading technology innovation

While AI rightly garners much attention in the current research landscape, researchers at Microsoft are still making plenty of progress across a spectrum of technical focus areas.

  • Project Silica, a cloud-based storage system underpinned by quartz glass, is designed to provide sustainable and durable archival storage that’s theoretically capable of lasting thousands of years.
  • Project Analog Iterative Machine (AIM) aims to solve difficult optimization problems—crucial across industries such as finance, logistics, transportation, energy, healthcare, and manufacturing—in a timely, energy-efficient, and cost-effective manner. Its designers believe Project AIM could outperform even the most powerful digital computers.
  • Microsoft researchers proved that 3D telemedicine (3DTM), using HoloportationTM communication technology, could help improve healthcare delivery, even across continents, in a unique collaboration with doctors and governments in Scotland and Ghana.
  • In another collaboration that aims to help improve precision medicine, Microsoft worked with industry and academic colleagues to release Terra, a secure, centralized, cloud-based platform for biomedical research on Microsoft Azure.
  • On the hardware front, Microsoft researchers are exploring sensor-enhanced headphones, outfitting them with controls that use head orientation and hand gestures to enable context-aware privacy, gestural audio-visual control, and animated avatars derived from natural body language.

Collaborating across academia, industries, and disciplines

Cross-company and cross-disciplinary collaboration has always played an important role in research and even more so as AI continues to rapidly advance. Large models driving the progress are components of larger systems that will deliver the value of AI to people. Developing these systems and the frameworks for determining their roles in people’s lives and society requires the knowledge and experience of those who understand the context in which they’ll operate—domain experts, academics, the individuals using these systems, and others.

Engaging and supporting the larger research community

Throughout the year, Microsoft continued to engage with the broader research community on AI and beyond. The company’s sponsorship of and participation in key conferences not only showcased its dedication to the application of AI in diverse technological domains but also underscored its unwavering support for cutting-edge advancements and collaborative community involvement.

Functional programming

  • Microsoft was a proud sponsor of ICFP 2023, with research contributions covering a range of functional programming topics, including memory optimization, language design, and software-development techniques.

Human-computer interaction

  • At CHI 2023, Microsoft researchers and their collaborators demonstrated the myriad and diverse ways people use computing today and will in the future. 

Large language models and ML

  • Microsoft was a sponsor of ACL 2023, showcasing papers ranging from fairness in language models to natural language generation and beyond.
  • Microsoft also sponsored NeurIPS 2023, publishing over 100 papers and conducting workshops on language models, deep learning techniques, and additional concepts, methods, and applications addressing pressing issues in the field.
  • With its sponsorship of and contribution to ICML 2023, Microsoft showcased its investment in advancing the field of machine learning.
  • Microsoft sponsored ML4H (opens in new tab) and participated in AfriCHI (opens in new tab) and EMNLP (opens in new tab), a leading conference in natural language processing and AI, highlighting its commitment to exploring how LLMs can be applied to healthcare and other vital domains.

Systems and advanced networking

Listeners’ choice: Notable podcasts for 2023

Thank you for reading

Microsoft achieved extraordinary milestones in 2023 and will continue pushing the boundaries of innovation to help shape a future where technology serves humanity in remarkable ways. To stay abreast of the latest updates, subscribe to the Microsoft Research Newsletter (opens in new tab) and the Microsoft Research Podcast (opens in new tab). You can also follow us on Facebook (opens in new tab), Instagram (opens in new tab), LinkedIn (opens in new tab), X (opens in new tab), and YouTube (opens in new tab). 

Writers, Editors, and Producers
Kristina Dodge
Kate Forster
Jessica Gartner
Alyssa Hughes
Gretchen Huizinga
Brenda Potts
Chris Stetkiewicz
Larry West

Managing Editor
Amber Tingle

Project Manager
Amanda Melfi

Microsoft Research Global Design Lead
Neeltje Berger

Graphic Designers
Adam Blythe
Harley Weber

Microsoft Research Creative Studio Lead
Matt Corwine

The post Research at Microsoft 2023: A year of groundbreaking AI advances and discoveries appeared first on Microsoft Research.

Read More

Research Focus: Week of December 18, 2023

Research Focus: Week of December 18, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus
December 18th, 2023

NASerEx: Optimizing Early Exits via AutoML for Scalable Efficient Inference in Big Image Streams

Deep Neural Networks (DNNs) are essentially stacked transformation functions (layers) that generate progressively complex features/encoding. This makes them universal approximators and allows for unprecedented success in complex tasks. This inferential effectiveness comes at the cost of increased computational complexity, making DNNs hard to scale for operational efficiency in AI applications, especially when running on resource-constrained hardware. 

In a recent paper: NASerEx: Optimizing Early Exits via AutoML for Scalable Efficient Inference in Big Image Streams, researchers from Microsoft and their collaborators propose a new framework to address this problem. NASerEX leverages neural architecture search (NAS) with a novel saliency-constrained search space and exit decision metric to learn suitable early exit structures to augment deep neural models for scalable efficient inference on big image streams. Optimized exit-augmented models, with the power of smart adaptive inference, perform ~2.5x faster having ~4x aggregated lower effective FLOPs, with no significant accuracy loss.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


InsightPilot: An LLM-Empowered Automated Data Exploration System

Effective data exploration requires in-depth knowledge of the dataset and the user intent, and expertise in data analysis techniques. Not being familiar with either can create obstacles that make the process time-consuming and overwhelming.

In a recent paper, InsightPilot: An LLM-Empowered Automated Data Exploration System, researchers from Microsoft address this issue. InsightPilot is a large language model (LLM)-based, automated system designed to simplify the data exploration process. It features a set of carefully designed analysis actions that streamline the data exploration process. Given a natural language question, InsightPilot collaborates with the LLM to issue a sequence of analysis actions, explore the data, and generate insights. The authors demonstrate the effectiveness of InsightPilot in a user study and a case study, showing how it can help users gain valuable insights from their datasets. 


Boosting Cloud Efficiency: Harnessing Data-Driven Decision-Making and Optimization Techniques

Microsoft’s cloud system serves as the backbone for the daily operations of hundreds of thousands of organizations, driving productivity and collaboration. The foundational infrastructure demands both high reliability and efficiency. In a new blog post, Microsoft’s Systems Innovation team explores some recent innovations to continually enhance hyper-scale cloud capacity efficiency, delivering substantial operational cost savings for customers.

Systems Innovation is a collaboration between Microsoft 365, Microsoft Research and Azure. The research group is focused on leveraging their shared deep workload understanding and combining algorithmic research with AI/machine learning techniques and hardware innovation to improve operational reliability and efficiency.


NeurIPS Large Language Model Efficiency Challenge

Large language models (LLMs) trained on large bodies of text can solve tasks with few supervised examples. These few-shot models have shown state-of-the-art success across natural language processing (NLP) tasks, language translation, standardized exams, and coding challenges, as well as in subjective domains such as chatbots. All of these domains involve bootstrapping a single LLM referred to as a foundation model with examples of specific knowledge from the associated task.

The process of updating a model with limited domain-specific data is known as fine-tuning. However, the costs of accessing, fine-tuning and querying foundation models to perform new tasks can be large.

To help democratize access to language models, Microsoft and other industry leaders were pleased to sponsor the NeurIPS Large Language Model Efficiency Challenge, (opens in new tab) which addressed three major issues:

  1. Lack of transparency around model training methods leads to a majority of models being not reproducible.
  2. The absence of a standard benchmark to evaluate these models side-by-side.
  3. Insufficient access to dedicated hardware prevents widespread availability and usage of these models.

The challenge to the community was to adapt a foundation model to specific tasks by fine-tuning on a single GPU of either 4090 or A100 (40GB) within a 24-hour (1-day) time frame, while maintaining high accuracy for these desired tasks.

Each submission was evaluated for accuracy and computational performance tradeoffs at commodity hardware scales. Insights and lessons were distilled into a set of well documented steps and easy-to-follow tutorials. The machine learning community will have documentation on how to achieve the same performance as winning entries, which will serve as the starting point to help them build their own LLM solutions.

The post Research Focus: Week of December 18, 2023 appeared first on Microsoft Research.

Read More