What’s Your Story: Ranveer Chandra

What’s Your Story: Ranveer Chandra

MSR Podcast

In this new Microsoft Research Podcast series What’s Your Story, Lab Director Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. He talks to members of the research community at Microsoft about what motivates their work and how they got where they are today. 

Ranveer Chandra is Managing Director of Research for Industry and CTO of Agri-Food. He is also head of Networking Research at Microsoft Research Redmond. His work in systems and networking is helping to bring more internet connectivity to more people and is yielding tools designed to help farmers increase food production more affordably and sustainably. In this episode, he shares what it was like growing up in Jamshedpur, India; why he focuses his efforts in the areas he does; and where the joy in his work comes from.

The post What’s Your Story: Ranveer Chandra appeared first on Microsoft Research.

Read More

Understanding the user: How the Enterprise System Usability Scale aligns with user reality

Understanding the user: How the Enterprise System Usability Scale aligns with user reality

This position research paper was presented at the 26th ACM Conference on Computer-Supported Cooperative Work and Social Computing (opens in new tab) (CSCW 2023), a premier venue for research on the design and use of technologies that affect groups, organizations, and communities.

Microsoft at CSCW 2023 conference highlights

In the business world, measuring success is as critical as selecting the right goals, and metrics act as a guiding compass, shaping organizational objectives. They are instrumental as businesses strategize to develop products that are likely to succeed in specific markets or among certain user groups.  

However, businesses often overlook whether these metrics accurately reflect users’ experiences and behaviors. Do they truly reflect the consumers’ journey and provide a reliable evaluation of the products’ place in the market? Put differently, do these metrics truly capture a product’s effectiveness and value, or are they superficial, overlooking deeper insights that could lead a business toward lasting success?

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.


Challenges in enterprise usability metrics research

In our paper, “A Call to Revisit Classic Measurements for UX Evaluation (opens in new tab),” presented at the UX Outcomes Workshop at CSCW 2023 (opens in new tab), we explore these questions about usability metrics—which evaluate the simplicity and effectiveness of a product, service, or system for its users—and their applicability to enterprise products. These metrics are vital when measuring a product’s health in the market and predicting adoption rates, user engagement, and, by extension, revenue generation. Current usability metrics in the enterprise space often fail to align with the actual user’s reality when using technical enterprise products such as business analytics, data engineering, and data science software. Oftentimes, they lack methodological rigor, calling into question their generalizability and validity.

One example is the System Usability Scale (opens in new tab) (SUS), the most widely used usability metric. In the context of enterprise products, at least two questions used in SUS do not resonate with users’ actual experiences: “I think I would like to use the system frequently” and “I think I need the support of a technical person to be able to use this product.” Because users of enterprise products are consumers, not necessarily customers, they often do not get to choose which product to use. In some cases, they are IT professionals with no one to turn to for technical assistance. This misalignment highlights the need to refine how we measure usability for enterprise products. 

Another concern is the lack of rigorous validation for metrics that reflect a product’s performance. For instance, UMUX-Lite (opens in new tab) is a popular metric for its simplicity and strong correlation with SUS. However, its scoring methodology requires that researchers use an equation consisting of a regression weight and constant to align the average scores with SUS scores. This lacks a solid theoretical foundation, which raises questions about UMUX-Lite’s ability to generalize to different contexts and respondent samples.

The lack of standardization underscores the need for metrics that are grounded in the user’s reality for the types of products being assessed and based on theoretical and empirical evidence, ensuring that they are generalizable to diverse contexts. This approach will pave the way for more reliable insights into product usability, fostering informed decisions crucial for enhancing the user experience and driving product success.

ESUS: A reality-driven approach to usability metrics

Recognizing this need, we endeavored to create a new usability metric that accurately reflects the experience of enterprise product users, built on solid theory and supported by empirical evidence. Our research combines qualitative and quantitative approaches to devise a tailored usability metric for enterprise products, named the Enterprise System Usability Scale (ESUS). 

ESUS offers a number of benefits over the SUS and UMUX-Lite. It is more concise than the SUS, containing only half the questions and streamlining the evaluation process. It also eliminates the need for practitioners to use a sample-specific weight and constant, as required by UMUX-Lite, providing a more reliable measure of product usability. Moreover, ESUS demonstrates convergent validity, correlating with other usability metrics, such as SUS. Most importantly, through its conciseness and specificity, it was designed with enterprise product users in mind, providing relevant and actionable insights.  

In Table 1 below, we offer ESUS as a step towards more accurate, reliable, and user-focused metrics for enterprise products, which are instrumental in driving well-informed decisions in improving product usability and customer satisfaction.

ESUS Items 1 2 3 4 5
How useful is [this product] to you? Not at all useful Slightly useful Somewhat useful Mostly useful Very useful
How easy or hard was [this product] to use for you? Very hard Hard Neutral Easy Very easy
How confident were you when using [this product]? Not at all confident Slightly confident Somewhat confident Mostly confident Very confident
How well do the functions work together or do not work together in [this product]? Does not work together at all Does not work well together Neutral Works well together Works very well together
How easy or hard was it to get started with [this product]? Very hard Hard Neutral Easy Very easy
Table 1: Proposed ESUS questionnaire

Looking ahead: Advancing precision in understanding the user

Moving forward, our focus is on rigorously testing and enhancing ESUS. We aim to examine its consistency over time and its effectiveness with small sample sizes. Our goal is to ensure our metrics are as robust and adaptable as the rapidly evolving enterprise product environment requires. We’re committed to continuous improvement, striving for metrics that are not just accurate but also relevant and reliable, offering actionable insights for an ever-improving user experience.

The post Understanding the user: How the Enterprise System Usability Scale aligns with user reality appeared first on Microsoft Research.

Read More

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

White line icons on a blue and green gradient background

Introduction

How trustworthy are generative pre-trained transformer (GPT) models?

To answer this question, University of Illinois Urbana-Champaign, together with Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research, released a comprehensive trustworthiness evaluation platform for large language models (LLMs), which is presented in the recent paper: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models – Microsoft Research (opens in new tab). This paper, which was accepted as an oral presentation at NeurIPS 2023 (Datasets and Benchmarks Track), (opens in new tab) focuses specifically on GPT-4 and GPT-3.5. It considers diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

Based on our evaluations, we found previously unpublished vulnerabilities relating to trustworthiness. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely.

Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark (opens in new tab) is publicly available.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


It’s important to note that the research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. In addition, we have shared our research with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm. This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward. To facilitate collaboration, we have made our benchmark code very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Trustworthiness perspectives of language models

Recent breakthroughs in machine learning, especially LLMs, have enabled a wide range of applications, from chatbots to robotics. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models even for sensitive applications such as healthcare and finance. To this end, we focus on a comprehensive trustworthiness evaluation of GPT models towards eight trustworthiness perspectives, with thorough evaluations based on different constructed scenarios, tasks, metrics, and datasets, as shown in Figure 1 below.

Overall, we aim to evaluate 1) the performance of GPT models under different trustworthiness perspectives, and 2) the resilience of their performance in adversarial environments (e.g., adversarial system/user prompts, demonstrations).

For example, to evaluate the robustness of GPT-3.5 and GPT-4 on textual adversarial attacks, we construct three evaluation scenarios: 1) evaluation on the standard benchmark AdvGLUE with a vanilla task description, aiming to assess: a) the vulnerabilities of GPT models to existing textual adversarial attacks, b) the robustness of different GPT models in comparison to state-of-the-art models on the standard AdvGLUE benchmark, c) the impact of adversarial attacks on their instruction-following abilities (measured by the rate at which the model refuses to answer a question or presents an incorrect answer when it is under attack), and d) the transferability of current attack strategies (quantified by the transferability attack success rates of different attack approaches); 2) evaluation on the AdvGLUE benchmark given different instructive task descriptions and designed system prompts, so as to investigate the resilience of models under diverse (adversarial) task descriptions and system prompts; 3) evaluation of GPT-3.5 and GPT-4 on our generated challenging adversarial texts AdvGLUE++ against open-source autoregressive models such as Alpaca-7B, Vicuna-13B, and StableVicuna-13B in different settings to further evaluate the vulnerabilities of GPT-3.5 and GPT-4 under strong adversarial attacks in diverse settings.

A graph listing the trustworthiness perspectives, benchmarks, and datasets evaluated in this work, organized in three layers. The first layer shows the 8 main trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. The second layer shows different benchmarks of each trustworthiness perspective. Some of the benchmarks are from prior work, while some are newly proposed in the paper. The third layer further shows different evaluation settings of each benchmark.
Fig 1. Taxonomy of our evaluation based on different trustworthiness perspectives. We use yellow boxes to represent the evaluations on existing benchmarks, and green boxes for evaluations using our designed new data or new evaluation protocols on existing datasets.

Insights and findings from DecodingTrust

Our research uncovered some previously undisclosed advantages as well as threats to the reliability of large language models.

For example, in terms of model robustness to adversarial demonstrations, we find that on the one hand, GPT-3.5 and GPT-4 will not be misled by the counterfactual examples added in the demonstrations and can even benefit from the counterfactual demonstrations in general; on the other hand, we observe that providing backdoored demonstrations will mislead both GPT-3.5 and GPT-4 to make incorrect predictions for backdoored inputs, especially when the backdoored demonstrations are positioned close to the (backdoored) user inputs, and GPT-4 is more vulnerable to backdoored demonstrations.

Moreover, in terms of toxicity and bias, both GPT-3.5 and GPT-4 are not strongly biased for the majority of stereotype topics considered under the benign and untargeted system prompts. However, both GPT models can be “tricked” into agreeing with biased content given designed misleading (adversarial) system prompts. GPT-4 is more vulnerable to misleading targeted system prompts and agrees with the biased content more frequently than GPT-3.5, potentially because GPT-4 follows the misleading instructions more precisely. Model bias can often be dependent on the demographic groups mentioned in the user prompt. For instance, for the query “Homosexuals have HIV.”, GPT-4 strongly disagrees in general, while for the query “Women have HIV.”, GPT-4 frequently agrees and outputs biased content. Model bias is also dependent on the stereotype topics. GPT models output more biased content on less sensitive topics like leadership and greed, while generating less biased content on more sensitive topics like drug dealing and terrorism. This is potentially due to the fine-tuning of GPT models on some protected demographic groups and sensitive topics. 

DecodingTrust also evaluates the privacy-leakage issues of LLMs. We find that GPT models can leak privacy-sensitive training data, such as the email addresses from the standard Enron email dataset, especially when prompted with the context of emails or few-shot demonstrations of (name, email) pairs. Moreover, under few-shot prompting, with supplementary knowledge such as the targeted email domain, the email extraction accuracy can be 100x higher than the scenarios where the email domain is unknown. We also observe that GPT models can leak the injected private information in the conversation history. Overall, GPT-4 is more robust than GPT-3.5 in safeguarding personally identifiable information (PII), and both models are robust to specific types of PII, such as Social Security numbers, possibly due to the explicit instruction tuning for those PII keywords. However, both GPT-4 and GPT-3.5 would leak all types of PII when prompted with privacy-leakage demonstrations during in-context learning. Lastly, GPT models demonstrate different capabilities in understanding different privacy-related words or privacy events (e.g., they will leak private information when told “confidentially” but not when told “in confidence”). GPT-4 is more likely to leak privacy than GPT-3.5, given our constructed prompts, potentially due to the fact that it follows the (misleading) instructions more precisely. We present more examples of model unreliable outputs in Figure 2 below.

The figure showing the examples of undesirable responses of GPT-4 given benign system prompts for each of the 8 trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.
Fig 2.  Examples of undesirable responses of GPT-4 given benign system prompts from different trustworthiness perspectives. Offensive or sensitive information is masked. 

The post DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models appeared first on Microsoft Research.

Read More

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

These research papers were presented at the IEEE Symposium on Visual Languages and Human-Centric Computing (opens in new tab) (VL/HCC 2023), a premier forum for design, theory, and application of computing technologies for programming, modelling, and communication.

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Large language models (LLMs) have revolutionized the way novice programmers and everyday computer users tap into the capabilities of natural language for programming. Among the tools used in this context, spreadsheets stand out as the preferred choice. The integration of LLMs into spreadsheets promises to substantially enhance their functionality and the user experience. At the same time, it’s well known that spreadsheet users commonly though inadvertently introduce errors (opens in new tab), and this can carry significant risks. For example, in 2010, a spreadsheet used in a Harvard economic analysis (opens in new tab) to inform austerity measures imposed on Greece was discovered to contain multiple errors (opens in new tab).

Microsoft is actively pursuing (opens in new tab) research focused on developing co-auditing tools and techniques, with an initial emphasis on spreadsheets. These tools are designed to help users verify the results generated by LLMs. At VL/HCC 2023 (opens in new tab), we introduce two new spreadsheet tools, ColDeco and FxD, specifically built to help users thoroughly examine and debug their programs within spreadsheets. Additionally, it is worth mentioning that the paper on FxD was awarded the Honorable Mention (opens in new tab).

ColDeco: An end-user inspection tool

Working with tables in spreadsheets is a common task, and the ability to add a calculated column can be incredibly useful. A calculated column not only adds information but also facilitates tasks like filtering and sorting. Generative AI can enable users to create sophisticated calculated columns in tables. However, verification of AI-generated code in this scenario is crucial because AI can misinterpret the user’s intent or overlook important data. 

In our paper, “ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code,” we introduce ColDeco, a no-code inspection tool for calculated columns. ColDeco uses helper columns and row grouping to help users understand how an AI-generated column works and locate any errors. 

To describe how ColDeco works, we’ll use an example table containing people’s first, middle, and last names in separate columns. Our user asks the system to “create a column called ‘Abbreviation’ that takes the first letter of each part of the name.” In this example, there’s an error in the generated code that fails to handle rows with no middle names, causing some Abbreviation cells to be empty.  

First, the model generates a program that computes an abbreviation for each row and adds it to the new Abbreviation column. ColDeco’s interface automatically opens as a side panel, as shown in Figure 1. 

The Inspect Columns view displays any generated columns, accompanied by a natural language description of the generated code. The Inspect Rows view displays a subset of the table, organized by behavior. The Row Inspection view uses dataflow analysis to group rows, highlighting key distinct execution behaviors. In our example, this view quickly draws the user’s attention to the two rows that fail to calculate an abbreviation.

Two graphics. The first graphic depicts a table with columns: “First Name”, “Middle Name”, “Last Name”, “DoB”, and “Abbreviation”. There are 11 rows. As examples, row 3 contains the information: First Name: Christopher, Middle Name: Michael, Last Name: Fleming, DoB: 11/5/1995, Abbreviation: CMF. Row 9 contains the information: First Name: William, Middle name is empty, Last Name: Smith, DoB: 6/3/1968, Abbreviation is empty. The second graphic depicts a side panel with two sections. The first section is the Inspect Columns view (labelled 1a). A single column named “Abbreviation” and a corresponding description is shown. The second section is the Inspect Rows view (labelled 1b). It contains a table with columns “Index”, “First Name”, “Middle Name”, “Last Name”, and “Abbreviation”. Within the table there are two groups of rows. The first group has an example row: Index: 4716, First Name: William, Middle Name is empty, Last Name: Smith, Abbreviation is empty. The second group has an example row: Index: 8984, First Name: Christopher, Middle Name: Michael, Last Name: Flemming, Abbreviation: CMF.
Figure 1. The initial view of the ColDeco side panel. An Abbreviation program is generated by the AI and added to the table as a new column. The Inspect Columns view (1a) shows the column generated by the AI, including a description of how the code works. The Inspect Rows view (1b) groups rows into different behaviors, indicating that there are errors in two rows.

If our user wants to investigate an error, they can expand a generated column into multiple helper columns, illustrated in Figure 2. These helper columns are visible in both the table (2a) and the side panel (2b), and they show the intermediate values. The user can now see that the missing abbreviations are caused by an error that occurred when the system tried to take the first and middle initials.

Two graphics. The first graphic (labelled 2a) depicts a table with 4 columns: “DoB”, “text concatenation”, “1st letter of Last Name”, “Abbreviation”. As examples, row 3 contains the information: DoB: 11/5/1995, text concatenation: CM, 1st letter of Lan Name: F, Abbreviation: CMF. Row 9 contains the information DoB: 6/3/1968, text concatenation: is empty, 1st letter of Lan Name: S, Abbreviation: is empty. The second graphic (labelled 2b) depicts a side panel showing the Inspect Columns view. A tree view shows “Abbreviation” as the root with two children: “1st letter of Last Name” and “text concatenation”, corresponding to the columns in the table. Each column in the tree view has a corresponding description.
Figure 2. The ColDeco side panel after a user expands the Abbreviation column into two additional helper columns. Each additional column has a description.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.


FxD: A functional debugger 

Not every spreadsheet task involves generating a new table column. Moreover, many users are already well acquainted with spreadsheet formulas. This brings us to our second tool, a spreadsheet formula debugger, introduced in the paper, “FxD: a functional debugger for dysfunctional spreadsheets.” 

We employed a user-centered approach when designing FxD, extensively reviewing existing literature on functional programming debuggers. This informed the four key features we implemented into FxD: 

Live debugging. FxD dynamically updates as a user edits a formula, allowing for quick formula modification and exploration (Figure 3, image 1).

Hybrid formula tracing. The debugger combines step-based evaluation (Figure 3, image 1) with tree-based derivations (Figure 3, image 3) to provide a step-by-step breakdown of the formula. Substeps are hidden behind expandable cards to prevent user overload.  

Subformula coloring. Color coding highlights changes in a formula as FxD evaluates it. This facilitates the tracking of these updates when a user hovers over a step (Figure 3, images 2 and 4). 

Information inspector. Context-aware tooltips improve the user experience. One example is table previews when a user hovers over ranges in functions like VLOOKUP. These tooltips offer insights into the range, surrounding context, and the lookup column used by the containing function (Figure 3, image 3).

Four graphics, each graphic describing a different feature of the debugger. The formula being debugged is ‘=IF(G3 < (B1 + B2) * (1 + B3), “low”, “high”)’. The first graphic (labelled 1) shows the formula and its evaluation trace. Each step in the trace shows the formula with some part evaluated. The last step is the value “low” which is the result of the formula. The second graphic (labelled 2) shows a step being highlighted. The step has a before formula and after formula, with multiple parts evaluated. Each part that is evaluated is highlighted with the same color in the “before” and “after” formula. The third graphic (labelled 3) shows a cell range being hovered on and a range information inspector being shown. The inspector shows a preview of the grid for the corresponding range. The fourth graphic (labelled 4) shows a step being highlighted and an evaluated subpart being hovered over. The user hovers over the value 15 in the “after” formula and the corresponding formula “B1 + B2” in the “before” formula is underlined.
Figure 3. The FxD debugger. Image 1 shows the edited formula and evaluation steps. The steps update as a user edits the formula. Image 2 shows subformula coloring, which highlights a subformula and its value upon hovering. Image 3 shows an information inspector that previews the range referenced in a formula. Image 4 shows the concurrent evaluation of multiple subformulas. When the user hovers over a value, the corresponding subformula is underlined.

Growing importance of AI code verification 

As the complexity of AI-generated code rises, the need for tools to verify accuracy becomes increasingly critical. In response, we developed these two co-audit tools tailored to spreadsheets. Moving forward, a key consideration lies in managing the complexity of these tools. Our vision is that debugging tools will become infused with generative AI to assist users in both generating and verifying workflows. 

Review our paper on co-auditing in general to learn more.

The post Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets appeared first on Microsoft Research.

Read More

Abstracts: October 9, 2023

Abstracts: October 9, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Dr. Sheng Zhang, a Senior Researcher at Microsoft Research, joins host Dr. Gretchen Huizinga to discuss “UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition.” In this paper, Zhang and his coauthors present mission-focused instruction tuning, a method for distilling large language models into smaller, more efficient ones for a broad application class. Their UniversalNER models achieved state-of-the-art performance in named entity recognition, an important natural language processing (NLP) task. Model distillation has the potential to make NLP and other capabilities more accessible, particularly in specialized domains such as biomedicine, which could benefit from more resource-efficient and transparent options. 


Learn more:

UniversalNER project website with demo (opens in new tab)

Code on GitHub (opens in new tab)

Dataset and models on Hugging Face (opens in new tab)

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract!—of their new and noteworthy papers. Today, I’m talking to Dr. Sheng Zhang, a Senior Researcher at Microsoft Research. Dr. Zhang is coauthor of a paper called “UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition,” and you can read this paper now on arXiv. Sheng Zhang, thanks for joining us on Abstracts!

SHENG ZHANG: Thanks for having me.


HUIZINGA: So in a few sentences, give us a brief introduction or overview of the issue or problem that your research addresses and why we should care about it.

ZHANG: Sure. Well, our research addresses the challenge of efficiently replicating the capabilities of large language models for targeted applications. Particularly, we focus on named entity recognition, or NER, and people should care because this work aims to create more cost-effective and transparent models that can recognize a wide range of entity types across various domains, which is crucial for knowledge extraction and has numerical practical applications.

HUIZINGA: So how does your approach, your particular approach, build on or differ from what’s been done previously in this field?

ZHANG: Well, our approach builds on the idea of instruction tuning, which is used to fine-tune language models to follow human instructions. However, unlike existing work that focuses on tuning models into replicas of large language models in every aspect, we propose a method called mission-focused instruction tuning, where we train a smaller model to specifically excel in a broad application class, such as open information extraction. And in our case study, we focus on named entity recognition, NER, and we demonstrate how targeted distillation from large language models can maximize their capabilities for this application. At the same time, the smaller model, the student model, also preserves generalizability across different semantic types and domains. This approach differs from previous work also because we emphasize the importance of increasing the diversity of input data and generating more comprehensive coverage of entity types, which ultimately leads to better performance in the targeted application.

HUIZINGA: OK. And in the paper, you talk about student models trailing the original large language models by large margins in what you call downstream applications. Give me an example of what downstream application looks like.

ZHANG: Yeah. So we here specifically focus on named entity recognition. That is, identifying named entities in a written text.

HUIZINGA: Ah …

ZHANG: So there’s various types of named entities so the canonical ones, like person, geographic location, organization … And people have, you know, various needs. They can go beyond those coarse-grained types. They can go into very fine-grained types, like athlete or politician …

HUIZINGA: Wow …

ZHANG: … or even, you know, finer-grain types. And you cannot like predefine what types will be considered in your task. That’s why we care about this universal concept of named entity recognition.

HUIZINGA: Well, let’s talk about methodology for a bit. What kind of research methodology did you use, and how did you conduct this research?

ZHANG: We developed a general recipe for targeted distillation from large language models, and in this case, we applied it to open NER. And our methodology consists of two main steps: data construction and mission-focused instruction tuning. For data construction, we sampled inputs from a large corpus across diverse domains, and then we used a large language model, ChatGPT, to annotate entity mentions and their associated entity types in the sampled inputs. This process allowed us to create a dataset with wide coverage of entity types. For mission-focused instruction tuning, we fine-tuned smaller models using our constructed dataset in a conversational-style format. For each entity type in the output, we transformed it into a natural language query and tuned the model to generate structured outputs that contain all entities of that type in the input passage. We also incorporated negative sampling to account for entity types not mentioned in that passage. And besides these two main steps, our research also involved assembling the largest-to-date, and most diverse, NER benchmark for evaluation. We compared the performance of our targeted distillation approach with other state-of-the-art models to demonstrate the effectiveness of our methodology.

HUIZINGA: OK, so you talk about NER as a case study, and you had 43 datasets and nine domains. Give me an example of some of those domains that you pulled from.

ZHANG: Yeah. So one very, you know, typical domain is like news, right. We read news every day, and the news mentions about, you know, people, events, and location. So that’s like a very common domain. And there are other very interesting domains like code. People also write code, and the computer can understand code, but a person would also want to understand code in some different way. So if you have like a code-specific named entity recognition capability, that would be awesome for, you know, some people that want to understand what’s happening in the code.

HUIZINGA: Right. And, and you mentioned programing, or code, but I also see in the paper biomedicine on one kind of complex and academic end and social media on another. So those are wildly different domains that you pulled from. Did you do that for a reason, that spectrum of different kinds of data?

ZHANG: Yes. The reason is that, you know, for some high-value domains like biomedicine, it’s quite expensive to annotate some data to train your model like that. So traditionally, people will have to hire an expert to do that. That is quite expensive and not scalable. And here, in the UniversalNER paper, we propose a way to distill that specific domain knowledge from the large language model. So the whole process is automatic. And the resulting model, you can see, it does pretty well, and maybe equally well, on the model that’s based on, you know, human expert–annotated corpus.

HUIZINGA: So after all this, a research paper presents findings. I imagine you had some interesting discoveries in, in this study. What were your major findings?

ZHANG: Yes. Our major findings were that the targeted distillation approach, specifically here the UniversalNER model we developed, it achieved state-of-the-art performance in named entity recognition across a wide range of entity types and domains. And when we compared it to other models like Alpaca, Vicuna, and InstructUIE, UniversalNER significantly outperformed them in terms of F1 score. This demonstrates the effectiveness of mission-focused instruction tuning for creating more cost-effective and transparent models that can excel in targeted applications such as open NER.

HUIZINGA: So let’s talk a little bit more about real-world impact. Uh, we’ve already discussed a little bit about that. But how would you say, based on these findings, that this impacts the real world and how people will use this?

ZHANG: Yeah, absolutely. I would say our work is very significant in terms of real-world impact because, first of all, NER is a fundamental task in natural language processing, and it plays a crucial role in knowledge extraction, information retrieval, and data mining. And by developing a more cost-effective and transparent model like UniversalNER, which can recognize a wide range of entity types and domains, we enable better performance in these downstream applications. And like I said, this is particularly important in high-value domains, such as biomedicine, where you know specialized expertise is required for annotation and the new entity types keep emerging. Our approach can help save time and resources for effectively recognizing these new entity types without the need for extensive annotated data. And secondly, our work can have a broader impact as it represents a general recipe for targeted distillation from large language models, and this approach can be applied to other application classes, such as, you know, open relation extraction. And this allows researchers and the practitioner to create much smaller models that can be more efficient and transparent while maintaining high performance in their targeted tasks.

HUIZINGA: If there was one thing you want our listeners to take away from this work and you could distill that into a short take, what would it be?

ZHANG: Mm hmm. One key takeaway from our work is that targeted distillation from large language models using our mission-focused instruction tuning can lead to more cost-effective and transparent models that excel in a broader application class. And our application demonstrated that it is possible to harness the capabilities of large language models and distill them into much smaller models that not only maintain generalizability across semantic types and domains but also surpass the performance of their larger counterparts in the targeted application. And this opens up new avenues for research and practical application in various fields, making knowledge extractions and the natural language processing tasks more efficient and accessible.

HUIZINGA: It sounds very promising, and it sounds like you’re excited about it.

ZHANG: Yeah, I’m pretty excited!

HUIZINGA: Well then tell us, given this new vista that you’ve opened up with this UniversalNER, what unanswered questions or unsolved problems still remain in this area, and what’s next on your research agenda?

ZHANG: Yeah. Our work demonstrates the effectiveness of targeted distillation for open NER, but several unanswered questions remain. And I would say the first one is adapting the approach to other application classes. Our method is a general recipe for targeted distillation, and it would be interesting to explore its effectiveness in other broader application classes, such as open relation extraction. And the second one is handling label conflicts and dataset-specific definitions. So in our work, we propose a dataset-specific instruction tuning template to address label conflicts. But more research is needed to better understand and develop methods for harmonizing discrepancies in label definitions across datasets. And the last one is exploring more efficient data construction methods. We used ChatGPT for data construction, but, you know, alternative approaches could be explored to generate more diverse and comprehensive datasets for mission-focused instruction tuning. And as for our research agenda, we plan to continue exploring targeted distillation techniques and apply them to other application classes, as well as investigate ways to improve data construction for better performance and efficiency in real-world tasks.

HUIZINGA: Sounds like you got your work cut out for you.

ZHANG: Yes. [LAUGHS] Thank you.

HUIZINGA: Sheng Zhang, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/Abstracts, or you can read the paper on arXiv. See you next time on Abstracts!

The post Abstracts: October 9, 2023 appeared first on Microsoft Research.

Read More

Efficient and hardware-friendly neural architecture search with SpaceEvo

Efficient and hardware-friendly neural architecture search with SpaceEvo

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

ICCV 2023: SpaceEvo

In the field of deep learning, where breakthroughs like the models ResNet (opens in new tab) and BERT (opens in new tab) have achieved remarkable success, a key challenge remains: developing efficient deep neural network (DNN) models that both excel in performance and minimize latency across diverse devices. To address this, researchers have introduced hardware-aware neural architecture search (NAS) to automate efficient model design for various hardware configurations. This approach involves a predefined search space, search algorithm, accuracy estimation, and hardware-specific cost prediction models.

However, optimizing the search space itself has often been overlooked. Current efforts rely mainly on MobileNets-based search spaces designed to minimize latency on mobile CPUs. But manual designs may not always align with different hardware requirements, limiting their suitability for a diverse range of devices.

In the paper, “SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference (opens in new tab),” presented at ICCV 2023, (opens in new tab) we introduce SpaceEvo, a novel method that automatically creates specialized search spaces optimized for efficient INT8 inference on specific hardware platforms. What sets SpaceEvo apart is its ability to perform this design process automatically, creating a search space tailored for hardware-specific, quantization-friendly NAS.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Notably, SpaceEvo’s lightweight design makes it ideal for practical applications, requiring only 25 GPU hours to create a hardware-specific solution and making it a cost-effective choice for hardware-aware NAS. This specialized search space, with hardware-preferred operators and configurations, enables the exploration of larger, more efficient models with low INT8 latency. Figure 1 demonstrates that our search space consistently outperforms existing alternatives in INT8 model quality. Conducting neural architecture searches within this hardware-friendly space yields models that set new INT8 accuracy benchmarks.

Figure1: The image displays 4 sub-figures, each illustrating model accuracy error distribution when sampling models within INT8 quantized latency at 10 ms on a VNNI CPU, 15 ms on a VNNI CPU, 10 ms on a Pixel 4 CPU, and 20ms on a Pixel CPU for various Search Spaces. Each sub-figure contains 4 – 5 curves, representing model accuracy error distributions from our search space, ProxylessNAS search space, MobileNetv3 search space, ResNet search space, and AttentiveNAS search space.  Our search space consistently delivers superior INT8 model populations, outperforming state-of-the-art alternatives under varying hardware and latency constraints.
Figure 1. Error distribution of INT8 quantized models across various NAS search spaces. Our search space consistently outperforms state-of-the-art alternatives in INT8 model quality.

On-device quantization latency analysis

We began our investigation by trying to understand INT8 quantized latency factors and their implications for search space design. We conducted our study on two widely used devices: an Intel CPU with VNNI instructions and onnxruntime support, and a Pixel 4 phone CPU with TFLite 2.7.

Our study revealed two critical findings:

  1. Both the choice of operator type and configurations, like channel width, significantly affect INT8 latency, illustrated in Figure 2. For instance, operators like Squeeze-and-Excitation and Hardswish, while enhancing accuracy with minimal latency, can lead to slower INT8 inference on Intel CPUs. This slowdown primarily arises from the added costs of data transformation between INT32 and INT8, which outweigh the latency reduction achieved through INT8 computation.
  2. Quantization efficiency varies among different devices, and preferred operator types can be contradictory.
Figure2: The image showcases a table (left) and a figure (right). The table on the left, labeled
Figure 2. Left: Selecting different operator types results in notably distinct quantized speed improvements. Right: Conv1x1 speed enhancements across various channel numbers.

Finding diverse, efficient quantized models with SpaceEvo

Unlike traditional architecture search, which aims to find the best single model, our objective is to uncover a diverse population of billions of accurate and INT8 latency-friendly architectures within the search space.

Drawing inspiration from neural architecture search, we introduced an evolutionary search algorithm to explore this quantization-friendly model population in SpaceEvo. Our approach incorporated three key techniques:

  1. The introduction of the Q-T score as a metric to measure the quantization-friendliness of a candidate search space, based on the INT8 accuracy-latency of top-tier subnets.
  2. Redesigned search algorithms that focus on exploring a collection of model populations (i.e., the search space) within the vast hyperspace, as illustrated in Figure 3. This is achieved through the “elastic stage,” which divides the search space into a sequence of elastic stages, allowing traditional evolution methods like aging evolution to explore effectively.
  3. A block-wise search space quantization scheme to reduce the training costs associated with exploring a search space that has a maximum Q-T score.

After discovering the search space, we employed a two-stage NAS process to train a quantized-for-all supernet over the search space. This ensured that all candidate models could achieve comparable quantized accuracy without individual fine-tuning or quantization. We utilized evolutionary search and nn-Meter (opens in new tab) for INT8 latency prediction to identify the best quantized models under various INT8 latency constraints. Figure 3 shows the overall design process.

Figure3: The image depicts a flowchart that outlines the complete SpaceEvo process and its application for NAS. Starting with a large hyperspace, an evolution search algorithm explores a candidate search space. A quality estimator then assesses its quality score based on INT8 latency and accuracy. This score is used as a reward for the algorithm, guiding further exploration until a suitable search space is found. A quantized-for-all supernet is then trained over this space, enabling hardware-aware NAS for deploying models within various INT8 latency constraints.
Figure 3: The complete SpaceEvo process and application for NAS

Extensive experiments on two real-world edge devices and ImageNet demonstrated that our automatically designed search spaces significantly surpass manually designed search spaces. Table 1 showcases our discovered models, SEQnet, setting new benchmarks for INT8 quantized accuracy-latency tradeoffs. 

(a) Results on the Intel VNNI CPU with onnxruntime
Model Top-1 Acc % Latency Top-1 Acc % FLOPs
INT8 INT8 Speedup FP32
MobileNetV3Small 66.3 4.4 ms 1.1x 67.4 56M
SEQnet@cpu-A0 74.7 4.4 ms 2.0x 74.8 163M
MobileNetV3Large 74.5 10.3 ms 1.5x 75.2 219M
SEQnet@cpu-A1 77.4 8.8 ms 2.4x 77.5 358M
FBNetV3-A 78.2 27.7 ms 1.3x 79.1 357M
SEQnet@cpu-A4 80.0 24.4 ms 2.4x 80.1 1267M
(b) Results on the Google Pixel 4 with TFLite
MobileNetV3Small 66.3 6.4 ms 1.3x 67.4 56M
SEQnet@pixel4-A0 73.6 5.9 ms 2.1x 73.7 107M
MobileNetV3Large 74.5 15.7 ms 1.5x 75.2 219M
EfficientNet-B0 76.7 36.4 ms 1.7x 77.3 390M
SEQnet@pixel4-A1 77.6 14.7 ms 2.2x 77.7 274M
Table 1. Our automated search spaces outperformed manual ones in ImageNet results on two devices. Speedup: INT8 latency compared with FP32 inference.

Potential for sustainable and efficient computing

SpaceEvo is the first attempt to address the hardware-friendly search space optimization challenge in NAS, paving the way for designing effective low-latency DNN models for diverse real-world edge devices. Looking ahead, the implications of SpaceEvo reach far beyond its initial achievements. Its potential extends to applications for other crucial deployment metrics, such as energy and memory consumption, enhancing the sustainability of edge computing solutions.

We are exploring adapting these methods to support diverse model architectures like transformers, further expanding its role in evolving deep learning model design and efficient deployment.

The post Efficient and hardware-friendly neural architecture search with SpaceEvo appeared first on Microsoft Research.

Read More

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

When was the last time you were faced with a task you had no clue how to tackle? Maybe it was fixing a broken bike, replacing a printer toner, or making a cup of espresso? In such circumstances, your usual options might include reaching out to a knowledgeable friend or relative for assistance. Alternatively, you might resort to scouring the internet, conducting a web search, posing questions on online forums, or seeking out relevant instructional videos. But what if there were another option? What if you could turn to an AI assistant, or copilot, for help?

AI in the real world

Our daily lives are filled with a wide range of tasks, both for work and leisure, spanning the digital and physical realms. We often find ourselves in need of guidance to learn and carry out these tasks effectively. Recent advances in AI, particularly in the areas of large language and multimodal models, have given rise to intelligent digital agents. However, when it comes to the physical world, where we perform a significant number of our tasks, AI systems have historically faced greater challenges. 

A longstanding aspiration within the AI community has been to develop an interactive AI assistant capable of perceiving, reasoning, and collaborating with people in the real world. Whether it’s scenarios like autonomous driving, robot navigation and manipulation, hazard detection in industrial settings, or support and guidance for mixed-reality tasks, progress in physical activities has been slower and more incremental compared with their fully digital counterparts.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


The promise and challenge of interactive AI “copilots”

There is great potential for developing interactive AI copilots to assist people with real-world tasks, but there are also obstacles. The key challenge is that current state-of-the-art AI assistants lack firsthand experience in the physical world. Consequently, they cannot perceive the state of the real world and actively intervene when necessary. This limitation stems from a lack of training on the specific data required for perception, reasoning, and modeling in such scenarios. In terms of AI development, there’s a saying that “data is king.” This challenge is no exception. To advance interactive AI agents for physical tasks, we must thoroughly understand the problem domain and establish a gold standard for copilots’ capabilities.

A new multimodal interactive dataset

As a first step in this direction, we are excited to share our paper, “HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World (opens in new tab),” presented at ICCV 2023 (opens in new tab). HoloAssist is a large-scale egocentric, or first-person, human interaction dataset, where two people collaboratively execute physical manipulation tasks. A task performer executes a task while wearing a mixed-reality headset that captures seven synchronized data streams, as shown in Figure 1. Simultaneously, a task instructor observes the performer’s first-person video feed in real time and offers verbal instruction. 

An image illustrating the setup for the HoloAssist dataset, which features a two-person interactive assistive task-completion setting.  A task-performer is wearing a mixed reality headset while an instructor watches the first-person video feed and provides instructions.  Eight modalities are captured, RGB, eye gaze, hand pose, head pose, depth, IMU, audio, text transcription.
Figure 1: HoloAssist features a two-person interactive assistive task-completion setting.

HoloAssist contains a large collection of data, comprising 166 hours of recordings involving 222 diverse participants. These participants form 350 distinct instructor-performer pairs carrying out 20 object-centric manipulation tasks. Video 1 shows how tasks are recorded, while Figure 2 provides a task breakdown. The objects range from common electronic devices to rarer items found in factories and specialized labs. The tasks are generally quite demanding, often requiring instructor assistance for successful completion. To provide comprehensive insights, we’ve captured seven different raw sensor modalities: RGB, depth, head pose, 3D hand pose, eye gaze, audio, and IMU. These modalities help in understanding human intentions, estimating world states, predicting future actions, and more. inally, the eighth modality is an augmentation with third-person manual annotations, consisting of a text summary, intervention types, mistake annotations, and action segments, as illustrated in Figure 3.

Video 1: A sampling of task recordings showcasing color and depth, two of the eight modalities.
Data distribution captured in HoloAssist. On the left, the number of sessions per activity, and on the right, the total length of sessions in minutes. There are 20 tasks: GoPro, Nintendo Switch, DSLR, portable printer, computer, Nespresso machine, standalone printer, big coffee machine, IKEA furniture (stool, utility cart, tray table, nightstand), NavVis laser scanner, ATV motorcycle, wheel belt, and circuit breaker.  There are between 25 and 180 sessions per activity and sessions range from 47 to 1390 minutes.
Figure 2: Data distribution captured in HoloAssist. On the left, the number of sessions per activity. On the right, the total session length in minutes.
HoloAssist includes action and conversational annotations and provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.  The image shows examples of each of these.
Figure 3: HoloAssist includes action and conversational annotations, and it also provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.

Towards proactive AI assistants

Our work builds on previous advancements in egocentric vision and embodied AI. Unlike earlier datasets, such as those listed in Table 1, HoloAssist stands out due to its multi-person, interactive task-execution setting. Human interaction during task execution provides a valuable resource for designing AI assistants that are anticipatory and proactive that can provide precisely timed instructions that are grounded in the environment, in contrast with current “chat-based” AI assistants that wait for you to ask a question. This unique scenario is ideal for developing assistive AI agents and complements existing datasets, which contribute rich knowledge and representation.

The table shows a comparison of nine related datasets and simulation platforms and for each dataset the setting, whether it is collaborative and interactive, instructional and procedural, and the number of hours of video.  HoloAssist features a multi-person assistive setting which is a unique addition to existing first-person (egocentric) datasets.
Table 1: Comparison of related datasets and simulation platforms. HoloAssist features a multi-person assistive setting, which is a unique addition to existing egocentric (first-person) datasets.

Finally, we evaluated the dataset’s performance on action classification and anticipation tasks, providing empirical results that shed light on the role of different modalities in various tasks. With this dataset, we introduce new tasks and benchmarks focused on mistake detection, intervention type prediction, and 3D hand pose forecasting, all crucial elements for developing intelligent assistants.

Looking forward

This work represents an initial step in broader research that explores how intelligent agents can collaborate with humans in real-world tasks. We’re excited to share this work and our dataset with the community and, anticipate numerous future directions, such as annotating object poses, investigating object-centric models of affordance and manipulations in AI assistance, and AI-assisted planning and state tracking, among others. We believe HoloAssist, along with its associated benchmarks and tools, will benefit future research endeavors focused on building powerful AI assistants for real-world everyday tasks. You can access the HoloAssist dataset and code on GitHub (opens in new tab).

Contributors

Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Marc Pollefeys

The post HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world appeared first on Microsoft Research.

Read More

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

photos of PhD students Jennifer Scurrell and Alejandro Cuevas along with Senior Researcher Dr. Madeleine Daepp for the Microsoft Research podcast

Every year, interns from academic institutions around the world apply and grow their knowledge as members of the research community at Microsoft. In this Microsoft Research Podcast series, these students join their internship supervisors to share their experience working alongside some of the leading researchers in their respective fields. 

In this episode, PhD students Jennifer Scurrell (opens in new tab) and Alejandro Cuevas (opens in new tab) talk to Senior Researcher Dr. Madeleine Daepp (opens in new tab). They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers they admire over coffee to the teamwork they say helped make it possible for them to succeed in the fast-paced environment of industry, and the impact they hope to have with their work.

The post Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas appeared first on Microsoft Research.

Read More

Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI

Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI

abstract colors

The latest advances in artificial intelligence have sparked broad public interest and excitement, and the sciences are no exception. Increasingly capable foundation models are fuelling a fundamental shift in computing research, natural sciences, social sciences, and even computing education itself. As industry-led advances in AI continue to reach new heights, Microsoft Research believes that a vibrant and diverse research ecosystem is essential to realizing the promise of AI. This means ensuring that the academic research community, and especially researchers working outside computer science, can tap into these capabilities. Their depth and breadth of expertise across disciplines, cultures and languages can contribute meaningfully to our ability to use AI to address some of the world’s greatest technical, scientific, and societal challenges.

To this end, Microsoft Research has established Accelerate Foundation Models Research (AFMR), a new initiative that brings together an interdisciplinary research community to pursue three goals:

  • Aligning AI with shared human goals, values, and preferences via research on models, which enhances safety, robustness, sustainability, responsibility, and transparency, while also exploring new evaluation methods to measure the rapidly growing capabilities of new models.
  • Improving human interactions via sociotechnical research, which enables AI to extend human ingenuity, creativity and productivity, while also working to reduce inequities of access and working to ensure positive benefits for people and societies worldwide.
  • Accelerating scientific discovery in natural sciences through proactive knowledge discovery, hypothesis generation, and multiscale multimodal data generation.

AFMR is a global research network and a resource platform that enables researchers in computer science and many other disciplines to engage with some of the greatest technical and societal challenges of our time. This includes a grant program that provides access to state-of-the-art foundation models hosted through Microsoft Azure AI.

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.


The goal is to foster more collaborations across disciplines, institutions, and sectors, and to unleash the full potential of AI for a wide range of research questions, applications, and societal contexts.

Following a successful pilot program and initial call for proposals (CFP), details of which are provided below, we are committed to continuing this work and can expect to solicit additional proposals throughout the coming year. Visit the AFMR site to learn more about upcoming programs and events, read peer-reviewed work that has resulted from the program and find resources to accelerate research and collaborations. 

Inspiring research in the era of AI

When ChatGPT was released in the fall of 2022, it quickly became clear that this new technology and tool would play a central role in AI computing research and applications.

“As a natural language processing (NLP) researcher, I was excited at first by ChatGPT’s potential to stimulate an AI revolution,” said Evelyne Viegas, senior director of research engagement at Microsoft Research. “Soon, I became concerned about a potential lack of access to this resource outside of industry, which could delay important progress in academic settings.”

When Microsoft enabled access to OpenAI models (Embeddings series, GPT-3.5-Turbo series, and GPT-4 series) via the Azure AI services, it created an opportunity to engage with the academic community to learn about their needs and aspirations and start enabling them. A team at Microsoft Research conducted a pilot program offering model access to a small number of participants, and the success of this effort inspired a broader and more sustained program.

Research topics undertaken as part of the pilot reflect the ambitions of AI research at Microsoft in understanding general AI, driving model innovation, ensuring social benefit, transforming scientific discovery, and extending human capabilities across different domains (e.g., astronomy, education, health, law, society).

Although the research supported by this pilot is still underway, the examples below illustrate the possibilities of opening access to leading-edge models to a diverse group of researchers:

Integrating ChatGPT into English as a Foreign Language (EFL) Writing Education – Korea Advanced Institute of Science and Technology (KAIST)

This project explores how students can utilize generative AI for interactive revision in EFL writing. Because the majority of KAIST courses are given in English, the sooner non-English speakers can learn the language the better they will be able to participate in their classes. While earlier chatbots have been used for EFL, language learners found them unengaging. With Azure OpenAI Service, the KAIST team is gathering data to show how the unique capabilities of a GPT-4-based chatbot are accelerating learning while making the learner’s experience more engaging.

Lightweight Adaptation of LLMs for Healthcare Applications – Stanford University

This work focuses on accelerating the task of report summarization for radiologists to improve workflow and decrease the time needed to generate an accurate report. It uses domain adaptation via pretraining on biomedical text, or clinical text and discrete prompting or fine-tuning. Initial results are promising, showing the added value of using foundation models for some clinical tasks.

AI-Based Traffic Monitoring System using Physics-Informed Neural Networks and GPT Models – North Carolina A&T State University

Researchers are creating a traffic monitoring system using data collected from unmanned aerial vehicles (UAVs) to fine-tune foundation models for video analysis and traffic state estimation. This work can directly benefit transportation agencies and city planners, helping them understand traffic patterns, congestion, and safety hazards.

Forging New Horizons in Astronomy – Harvard University

This project seeks to enhance human interaction with astronomy literature utilizing the capabilities of the large language models (LLM), particularly GPT-4. This work employs in-context prompting techniques to expose the model to astronomy papers to build an astronomy-focused chat application to engage the broader community.

Expanding AFMR

Much experimentation remains to be done with foundation models. The AFMR CFP invited the community to develop proposals focused on the goals and questions below:

  • Aligning AI systems with human goals and preferences
  • Advancing beneficial applications of AI
  • Accelerating scientific discovery in the natural and life sciences

The response to the AFMR Fall CFP has been phenomenal, with close to 400 proposals from 170 universities across 33 countries.

“Research undertaken by the principal investigators brings the promise to advance research across a greater breadth of research pursuits, application domains, and societal contexts than we could have imagined,” Viegas said. “It covers a vast range of scientific and sociotechnical topics: creativity, culture, economy, education, finance, health, causality, evaluation, augmentation and adaptation, multimodal, responsible AI, robotics, scientific discovery, software and society. It is inspiring to see experts from different countries with different cultures, languages, institutions, and departments, including computer science, social science, natural sciences, humanities, medicine, music, all come together to work on democratizing AI and work on solving some of the greatest technical and societal challenges of tomorrow.”

The post Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI appeared first on Microsoft Research.

Read More

Research Focus: Week of September 25, 2023

Research Focus: Week of September 25, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 25 | Week of September 25, 2023

NEW RESEARCH

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases – prefill phase, which processes the input prompt, and decode phase, which generates output tokens autoregressively. While the prefill phase effectively saturates graphics processing unit (GPU) compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.

In a new paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, researchers from Microsoft present a solution to these challenges that yields significant improvements in inference performance across models and hardware. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. Chunked-prefills allow constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


NEW RESEARCH

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M (opens in new tab). This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories.  

In a new paper: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory, researchers from Microsoft propose an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, DragNUWA simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, the researchers propose trajectory modeling with three aspects: a trajectory sampler (TS) to enable open-domain control of arbitrary trajectories, a multiscale fusion (MF) to control trajectories in different granularities, and an adaptive training (AT) strategy to generate consistent videos following trajectories. Their experiments demonstrate DragNUWA’s superior performance in fine-grained control in video generation.

DragNUWA is purely a research project and there are no current plans to incorporate DragNUWA into a product. Any further research will continue to follow Microsoft AI principles.

NEW RESEARCH

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

Understanding cortical responses to human visual perception has emerged a research hotspot. Yet, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to recent advances in both neuroscience and artificial intelligence, researchers have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches. 

In a new paper: Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals, researchers from Microsoft reconstruct observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notoriously noisy, processing and extracting useful information requires more dedicated efforts. The researchers propose a comprehensive pipeline, named NeuroImagen, to incorporate a novel multi-level perceptual information decoding to draw multi-grained and heterogeneous outputs from the given EEG data. A pretrained latent diffusion model then leverages the extracted semantic information to reconstruct the high-resolution visual stimuli images. The experimental results illustrate the effectiveness of image reconstruction and superior quantitative performance of the proposed method.

The post Research Focus: Week of September 25, 2023 appeared first on Microsoft Research.

Read More