Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

These research papers were presented at the IEEE Symposium on Visual Languages and Human-Centric Computing (opens in new tab) (VL/HCC 2023), a premier forum for design, theory, and application of computing technologies for programming, modelling, and communication.

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Large language models (LLMs) have revolutionized the way novice programmers and everyday computer users tap into the capabilities of natural language for programming. Among the tools used in this context, spreadsheets stand out as the preferred choice. The integration of LLMs into spreadsheets promises to substantially enhance their functionality and the user experience. At the same time, it’s well known that spreadsheet users commonly though inadvertently introduce errors (opens in new tab), and this can carry significant risks. For example, in 2010, a spreadsheet used in a Harvard economic analysis (opens in new tab) to inform austerity measures imposed on Greece was discovered to contain multiple errors (opens in new tab).

Microsoft is actively pursuing (opens in new tab) research focused on developing co-auditing tools and techniques, with an initial emphasis on spreadsheets. These tools are designed to help users verify the results generated by LLMs. At VL/HCC 2023 (opens in new tab), we introduce two new spreadsheet tools, ColDeco and FxD, specifically built to help users thoroughly examine and debug their programs within spreadsheets. Additionally, it is worth mentioning that the paper on FxD was awarded the Honorable Mention (opens in new tab).

ColDeco: An end-user inspection tool

Working with tables in spreadsheets is a common task, and the ability to add a calculated column can be incredibly useful. A calculated column not only adds information but also facilitates tasks like filtering and sorting. Generative AI can enable users to create sophisticated calculated columns in tables. However, verification of AI-generated code in this scenario is crucial because AI can misinterpret the user’s intent or overlook important data. 

In our paper, “ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code,” we introduce ColDeco, a no-code inspection tool for calculated columns. ColDeco uses helper columns and row grouping to help users understand how an AI-generated column works and locate any errors. 

To describe how ColDeco works, we’ll use an example table containing people’s first, middle, and last names in separate columns. Our user asks the system to “create a column called ‘Abbreviation’ that takes the first letter of each part of the name.” In this example, there’s an error in the generated code that fails to handle rows with no middle names, causing some Abbreviation cells to be empty.  

First, the model generates a program that computes an abbreviation for each row and adds it to the new Abbreviation column. ColDeco’s interface automatically opens as a side panel, as shown in Figure 1. 

The Inspect Columns view displays any generated columns, accompanied by a natural language description of the generated code. The Inspect Rows view displays a subset of the table, organized by behavior. The Row Inspection view uses dataflow analysis to group rows, highlighting key distinct execution behaviors. In our example, this view quickly draws the user’s attention to the two rows that fail to calculate an abbreviation.

Two graphics. The first graphic depicts a table with columns: “First Name”, “Middle Name”, “Last Name”, “DoB”, and “Abbreviation”. There are 11 rows. As examples, row 3 contains the information: First Name: Christopher, Middle Name: Michael, Last Name: Fleming, DoB: 11/5/1995, Abbreviation: CMF. Row 9 contains the information: First Name: William, Middle name is empty, Last Name: Smith, DoB: 6/3/1968, Abbreviation is empty. The second graphic depicts a side panel with two sections. The first section is the Inspect Columns view (labelled 1a). A single column named “Abbreviation” and a corresponding description is shown. The second section is the Inspect Rows view (labelled 1b). It contains a table with columns “Index”, “First Name”, “Middle Name”, “Last Name”, and “Abbreviation”. Within the table there are two groups of rows. The first group has an example row: Index: 4716, First Name: William, Middle Name is empty, Last Name: Smith, Abbreviation is empty. The second group has an example row: Index: 8984, First Name: Christopher, Middle Name: Michael, Last Name: Flemming, Abbreviation: CMF.
Figure 1. The initial view of the ColDeco side panel. An Abbreviation program is generated by the AI and added to the table as a new column. The Inspect Columns view (1a) shows the column generated by the AI, including a description of how the code works. The Inspect Rows view (1b) groups rows into different behaviors, indicating that there are errors in two rows.

If our user wants to investigate an error, they can expand a generated column into multiple helper columns, illustrated in Figure 2. These helper columns are visible in both the table (2a) and the side panel (2b), and they show the intermediate values. The user can now see that the missing abbreviations are caused by an error that occurred when the system tried to take the first and middle initials.

Two graphics. The first graphic (labelled 2a) depicts a table with 4 columns: “DoB”, “text concatenation”, “1st letter of Last Name”, “Abbreviation”. As examples, row 3 contains the information: DoB: 11/5/1995, text concatenation: CM, 1st letter of Lan Name: F, Abbreviation: CMF. Row 9 contains the information DoB: 6/3/1968, text concatenation: is empty, 1st letter of Lan Name: S, Abbreviation: is empty. The second graphic (labelled 2b) depicts a side panel showing the Inspect Columns view. A tree view shows “Abbreviation” as the root with two children: “1st letter of Last Name” and “text concatenation”, corresponding to the columns in the table. Each column in the tree view has a corresponding description.
Figure 2. The ColDeco side panel after a user expands the Abbreviation column into two additional helper columns. Each additional column has a description.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.


FxD: A functional debugger 

Not every spreadsheet task involves generating a new table column. Moreover, many users are already well acquainted with spreadsheet formulas. This brings us to our second tool, a spreadsheet formula debugger, introduced in the paper, “FxD: a functional debugger for dysfunctional spreadsheets.” 

We employed a user-centered approach when designing FxD, extensively reviewing existing literature on functional programming debuggers. This informed the four key features we implemented into FxD: 

Live debugging. FxD dynamically updates as a user edits a formula, allowing for quick formula modification and exploration (Figure 3, image 1).

Hybrid formula tracing. The debugger combines step-based evaluation (Figure 3, image 1) with tree-based derivations (Figure 3, image 3) to provide a step-by-step breakdown of the formula. Substeps are hidden behind expandable cards to prevent user overload.  

Subformula coloring. Color coding highlights changes in a formula as FxD evaluates it. This facilitates the tracking of these updates when a user hovers over a step (Figure 3, images 2 and 4). 

Information inspector. Context-aware tooltips improve the user experience. One example is table previews when a user hovers over ranges in functions like VLOOKUP. These tooltips offer insights into the range, surrounding context, and the lookup column used by the containing function (Figure 3, image 3).

Four graphics, each graphic describing a different feature of the debugger. The formula being debugged is ‘=IF(G3 < (B1 + B2) * (1 + B3), “low”, “high”)’. The first graphic (labelled 1) shows the formula and its evaluation trace. Each step in the trace shows the formula with some part evaluated. The last step is the value “low” which is the result of the formula. The second graphic (labelled 2) shows a step being highlighted. The step has a before formula and after formula, with multiple parts evaluated. Each part that is evaluated is highlighted with the same color in the “before” and “after” formula. The third graphic (labelled 3) shows a cell range being hovered on and a range information inspector being shown. The inspector shows a preview of the grid for the corresponding range. The fourth graphic (labelled 4) shows a step being highlighted and an evaluated subpart being hovered over. The user hovers over the value 15 in the “after” formula and the corresponding formula “B1 + B2” in the “before” formula is underlined.
Figure 3. The FxD debugger. Image 1 shows the edited formula and evaluation steps. The steps update as a user edits the formula. Image 2 shows subformula coloring, which highlights a subformula and its value upon hovering. Image 3 shows an information inspector that previews the range referenced in a formula. Image 4 shows the concurrent evaluation of multiple subformulas. When the user hovers over a value, the corresponding subformula is underlined.

Growing importance of AI code verification 

As the complexity of AI-generated code rises, the need for tools to verify accuracy becomes increasingly critical. In response, we developed these two co-audit tools tailored to spreadsheets. Moving forward, a key consideration lies in managing the complexity of these tools. Our vision is that debugging tools will become infused with generative AI to assist users in both generating and verifying workflows. 

Review our paper on co-auditing in general to learn more.

The post Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets appeared first on Microsoft Research.

Read More

Abstracts: October 9, 2023

Abstracts: October 9, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Dr. Sheng Zhang, a Senior Researcher at Microsoft Research, joins host Dr. Gretchen Huizinga to discuss “UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition.” In this paper, Zhang and his coauthors present mission-focused instruction tuning, a method for distilling large language models into smaller, more efficient ones for a broad application class. Their UniversalNER models achieved state-of-the-art performance in named entity recognition, an important natural language processing (NLP) task. Model distillation has the potential to make NLP and other capabilities more accessible, particularly in specialized domains such as biomedicine, which could benefit from more resource-efficient and transparent options. 


Learn more:

UniversalNER project website with demo (opens in new tab)

Code on GitHub (opens in new tab)

Dataset and models on Hugging Face (opens in new tab)

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract!—of their new and noteworthy papers. Today, I’m talking to Dr. Sheng Zhang, a Senior Researcher at Microsoft Research. Dr. Zhang is coauthor of a paper called “UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition,” and you can read this paper now on arXiv. Sheng Zhang, thanks for joining us on Abstracts!

SHENG ZHANG: Thanks for having me.


HUIZINGA: So in a few sentences, give us a brief introduction or overview of the issue or problem that your research addresses and why we should care about it.

ZHANG: Sure. Well, our research addresses the challenge of efficiently replicating the capabilities of large language models for targeted applications. Particularly, we focus on named entity recognition, or NER, and people should care because this work aims to create more cost-effective and transparent models that can recognize a wide range of entity types across various domains, which is crucial for knowledge extraction and has numerical practical applications.

HUIZINGA: So how does your approach, your particular approach, build on or differ from what’s been done previously in this field?

ZHANG: Well, our approach builds on the idea of instruction tuning, which is used to fine-tune language models to follow human instructions. However, unlike existing work that focuses on tuning models into replicas of large language models in every aspect, we propose a method called mission-focused instruction tuning, where we train a smaller model to specifically excel in a broad application class, such as open information extraction. And in our case study, we focus on named entity recognition, NER, and we demonstrate how targeted distillation from large language models can maximize their capabilities for this application. At the same time, the smaller model, the student model, also preserves generalizability across different semantic types and domains. This approach differs from previous work also because we emphasize the importance of increasing the diversity of input data and generating more comprehensive coverage of entity types, which ultimately leads to better performance in the targeted application.

HUIZINGA: OK. And in the paper, you talk about student models trailing the original large language models by large margins in what you call downstream applications. Give me an example of what downstream application looks like.

ZHANG: Yeah. So we here specifically focus on named entity recognition. That is, identifying named entities in a written text.

HUIZINGA: Ah …

ZHANG: So there’s various types of named entities so the canonical ones, like person, geographic location, organization … And people have, you know, various needs. They can go beyond those coarse-grained types. They can go into very fine-grained types, like athlete or politician …

HUIZINGA: Wow …

ZHANG: … or even, you know, finer-grain types. And you cannot like predefine what types will be considered in your task. That’s why we care about this universal concept of named entity recognition.

HUIZINGA: Well, let’s talk about methodology for a bit. What kind of research methodology did you use, and how did you conduct this research?

ZHANG: We developed a general recipe for targeted distillation from large language models, and in this case, we applied it to open NER. And our methodology consists of two main steps: data construction and mission-focused instruction tuning. For data construction, we sampled inputs from a large corpus across diverse domains, and then we used a large language model, ChatGPT, to annotate entity mentions and their associated entity types in the sampled inputs. This process allowed us to create a dataset with wide coverage of entity types. For mission-focused instruction tuning, we fine-tuned smaller models using our constructed dataset in a conversational-style format. For each entity type in the output, we transformed it into a natural language query and tuned the model to generate structured outputs that contain all entities of that type in the input passage. We also incorporated negative sampling to account for entity types not mentioned in that passage. And besides these two main steps, our research also involved assembling the largest-to-date, and most diverse, NER benchmark for evaluation. We compared the performance of our targeted distillation approach with other state-of-the-art models to demonstrate the effectiveness of our methodology.

HUIZINGA: OK, so you talk about NER as a case study, and you had 43 datasets and nine domains. Give me an example of some of those domains that you pulled from.

ZHANG: Yeah. So one very, you know, typical domain is like news, right. We read news every day, and the news mentions about, you know, people, events, and location. So that’s like a very common domain. And there are other very interesting domains like code. People also write code, and the computer can understand code, but a person would also want to understand code in some different way. So if you have like a code-specific named entity recognition capability, that would be awesome for, you know, some people that want to understand what’s happening in the code.

HUIZINGA: Right. And, and you mentioned programing, or code, but I also see in the paper biomedicine on one kind of complex and academic end and social media on another. So those are wildly different domains that you pulled from. Did you do that for a reason, that spectrum of different kinds of data?

ZHANG: Yes. The reason is that, you know, for some high-value domains like biomedicine, it’s quite expensive to annotate some data to train your model like that. So traditionally, people will have to hire an expert to do that. That is quite expensive and not scalable. And here, in the UniversalNER paper, we propose a way to distill that specific domain knowledge from the large language model. So the whole process is automatic. And the resulting model, you can see, it does pretty well, and maybe equally well, on the model that’s based on, you know, human expert–annotated corpus.

HUIZINGA: So after all this, a research paper presents findings. I imagine you had some interesting discoveries in, in this study. What were your major findings?

ZHANG: Yes. Our major findings were that the targeted distillation approach, specifically here the UniversalNER model we developed, it achieved state-of-the-art performance in named entity recognition across a wide range of entity types and domains. And when we compared it to other models like Alpaca, Vicuna, and InstructUIE, UniversalNER significantly outperformed them in terms of F1 score. This demonstrates the effectiveness of mission-focused instruction tuning for creating more cost-effective and transparent models that can excel in targeted applications such as open NER.

HUIZINGA: So let’s talk a little bit more about real-world impact. Uh, we’ve already discussed a little bit about that. But how would you say, based on these findings, that this impacts the real world and how people will use this?

ZHANG: Yeah, absolutely. I would say our work is very significant in terms of real-world impact because, first of all, NER is a fundamental task in natural language processing, and it plays a crucial role in knowledge extraction, information retrieval, and data mining. And by developing a more cost-effective and transparent model like UniversalNER, which can recognize a wide range of entity types and domains, we enable better performance in these downstream applications. And like I said, this is particularly important in high-value domains, such as biomedicine, where you know specialized expertise is required for annotation and the new entity types keep emerging. Our approach can help save time and resources for effectively recognizing these new entity types without the need for extensive annotated data. And secondly, our work can have a broader impact as it represents a general recipe for targeted distillation from large language models, and this approach can be applied to other application classes, such as, you know, open relation extraction. And this allows researchers and the practitioner to create much smaller models that can be more efficient and transparent while maintaining high performance in their targeted tasks.

HUIZINGA: If there was one thing you want our listeners to take away from this work and you could distill that into a short take, what would it be?

ZHANG: Mm hmm. One key takeaway from our work is that targeted distillation from large language models using our mission-focused instruction tuning can lead to more cost-effective and transparent models that excel in a broader application class. And our application demonstrated that it is possible to harness the capabilities of large language models and distill them into much smaller models that not only maintain generalizability across semantic types and domains but also surpass the performance of their larger counterparts in the targeted application. And this opens up new avenues for research and practical application in various fields, making knowledge extractions and the natural language processing tasks more efficient and accessible.

HUIZINGA: It sounds very promising, and it sounds like you’re excited about it.

ZHANG: Yeah, I’m pretty excited!

HUIZINGA: Well then tell us, given this new vista that you’ve opened up with this UniversalNER, what unanswered questions or unsolved problems still remain in this area, and what’s next on your research agenda?

ZHANG: Yeah. Our work demonstrates the effectiveness of targeted distillation for open NER, but several unanswered questions remain. And I would say the first one is adapting the approach to other application classes. Our method is a general recipe for targeted distillation, and it would be interesting to explore its effectiveness in other broader application classes, such as open relation extraction. And the second one is handling label conflicts and dataset-specific definitions. So in our work, we propose a dataset-specific instruction tuning template to address label conflicts. But more research is needed to better understand and develop methods for harmonizing discrepancies in label definitions across datasets. And the last one is exploring more efficient data construction methods. We used ChatGPT for data construction, but, you know, alternative approaches could be explored to generate more diverse and comprehensive datasets for mission-focused instruction tuning. And as for our research agenda, we plan to continue exploring targeted distillation techniques and apply them to other application classes, as well as investigate ways to improve data construction for better performance and efficiency in real-world tasks.

HUIZINGA: Sounds like you got your work cut out for you.

ZHANG: Yes. [LAUGHS] Thank you.

HUIZINGA: Sheng Zhang, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/Abstracts, or you can read the paper on arXiv. See you next time on Abstracts!

The post Abstracts: October 9, 2023 appeared first on Microsoft Research.

Read More

Efficient and hardware-friendly neural architecture search with SpaceEvo

Efficient and hardware-friendly neural architecture search with SpaceEvo

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

ICCV 2023: SpaceEvo

In the field of deep learning, where breakthroughs like the models ResNet (opens in new tab) and BERT (opens in new tab) have achieved remarkable success, a key challenge remains: developing efficient deep neural network (DNN) models that both excel in performance and minimize latency across diverse devices. To address this, researchers have introduced hardware-aware neural architecture search (NAS) to automate efficient model design for various hardware configurations. This approach involves a predefined search space, search algorithm, accuracy estimation, and hardware-specific cost prediction models.

However, optimizing the search space itself has often been overlooked. Current efforts rely mainly on MobileNets-based search spaces designed to minimize latency on mobile CPUs. But manual designs may not always align with different hardware requirements, limiting their suitability for a diverse range of devices.

In the paper, “SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference (opens in new tab),” presented at ICCV 2023, (opens in new tab) we introduce SpaceEvo, a novel method that automatically creates specialized search spaces optimized for efficient INT8 inference on specific hardware platforms. What sets SpaceEvo apart is its ability to perform this design process automatically, creating a search space tailored for hardware-specific, quantization-friendly NAS.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Notably, SpaceEvo’s lightweight design makes it ideal for practical applications, requiring only 25 GPU hours to create a hardware-specific solution and making it a cost-effective choice for hardware-aware NAS. This specialized search space, with hardware-preferred operators and configurations, enables the exploration of larger, more efficient models with low INT8 latency. Figure 1 demonstrates that our search space consistently outperforms existing alternatives in INT8 model quality. Conducting neural architecture searches within this hardware-friendly space yields models that set new INT8 accuracy benchmarks.

Figure1: The image displays 4 sub-figures, each illustrating model accuracy error distribution when sampling models within INT8 quantized latency at 10 ms on a VNNI CPU, 15 ms on a VNNI CPU, 10 ms on a Pixel 4 CPU, and 20ms on a Pixel CPU for various Search Spaces. Each sub-figure contains 4 – 5 curves, representing model accuracy error distributions from our search space, ProxylessNAS search space, MobileNetv3 search space, ResNet search space, and AttentiveNAS search space.  Our search space consistently delivers superior INT8 model populations, outperforming state-of-the-art alternatives under varying hardware and latency constraints.
Figure 1. Error distribution of INT8 quantized models across various NAS search spaces. Our search space consistently outperforms state-of-the-art alternatives in INT8 model quality.

On-device quantization latency analysis

We began our investigation by trying to understand INT8 quantized latency factors and their implications for search space design. We conducted our study on two widely used devices: an Intel CPU with VNNI instructions and onnxruntime support, and a Pixel 4 phone CPU with TFLite 2.7.

Our study revealed two critical findings:

  1. Both the choice of operator type and configurations, like channel width, significantly affect INT8 latency, illustrated in Figure 2. For instance, operators like Squeeze-and-Excitation and Hardswish, while enhancing accuracy with minimal latency, can lead to slower INT8 inference on Intel CPUs. This slowdown primarily arises from the added costs of data transformation between INT32 and INT8, which outweigh the latency reduction achieved through INT8 computation.
  2. Quantization efficiency varies among different devices, and preferred operator types can be contradictory.
Figure2: The image showcases a table (left) and a figure (right). The table on the left, labeled
Figure 2. Left: Selecting different operator types results in notably distinct quantized speed improvements. Right: Conv1x1 speed enhancements across various channel numbers.

Finding diverse, efficient quantized models with SpaceEvo

Unlike traditional architecture search, which aims to find the best single model, our objective is to uncover a diverse population of billions of accurate and INT8 latency-friendly architectures within the search space.

Drawing inspiration from neural architecture search, we introduced an evolutionary search algorithm to explore this quantization-friendly model population in SpaceEvo. Our approach incorporated three key techniques:

  1. The introduction of the Q-T score as a metric to measure the quantization-friendliness of a candidate search space, based on the INT8 accuracy-latency of top-tier subnets.
  2. Redesigned search algorithms that focus on exploring a collection of model populations (i.e., the search space) within the vast hyperspace, as illustrated in Figure 3. This is achieved through the “elastic stage,” which divides the search space into a sequence of elastic stages, allowing traditional evolution methods like aging evolution to explore effectively.
  3. A block-wise search space quantization scheme to reduce the training costs associated with exploring a search space that has a maximum Q-T score.

After discovering the search space, we employed a two-stage NAS process to train a quantized-for-all supernet over the search space. This ensured that all candidate models could achieve comparable quantized accuracy without individual fine-tuning or quantization. We utilized evolutionary search and nn-Meter (opens in new tab) for INT8 latency prediction to identify the best quantized models under various INT8 latency constraints. Figure 3 shows the overall design process.

Figure3: The image depicts a flowchart that outlines the complete SpaceEvo process and its application for NAS. Starting with a large hyperspace, an evolution search algorithm explores a candidate search space. A quality estimator then assesses its quality score based on INT8 latency and accuracy. This score is used as a reward for the algorithm, guiding further exploration until a suitable search space is found. A quantized-for-all supernet is then trained over this space, enabling hardware-aware NAS for deploying models within various INT8 latency constraints.
Figure 3: The complete SpaceEvo process and application for NAS

Extensive experiments on two real-world edge devices and ImageNet demonstrated that our automatically designed search spaces significantly surpass manually designed search spaces. Table 1 showcases our discovered models, SEQnet, setting new benchmarks for INT8 quantized accuracy-latency tradeoffs. 

(a) Results on the Intel VNNI CPU with onnxruntime
Model Top-1 Acc % Latency Top-1 Acc % FLOPs
INT8 INT8 Speedup FP32
MobileNetV3Small 66.3 4.4 ms 1.1x 67.4 56M
SEQnet@cpu-A0 74.7 4.4 ms 2.0x 74.8 163M
MobileNetV3Large 74.5 10.3 ms 1.5x 75.2 219M
SEQnet@cpu-A1 77.4 8.8 ms 2.4x 77.5 358M
FBNetV3-A 78.2 27.7 ms 1.3x 79.1 357M
SEQnet@cpu-A4 80.0 24.4 ms 2.4x 80.1 1267M
(b) Results on the Google Pixel 4 with TFLite
MobileNetV3Small 66.3 6.4 ms 1.3x 67.4 56M
SEQnet@pixel4-A0 73.6 5.9 ms 2.1x 73.7 107M
MobileNetV3Large 74.5 15.7 ms 1.5x 75.2 219M
EfficientNet-B0 76.7 36.4 ms 1.7x 77.3 390M
SEQnet@pixel4-A1 77.6 14.7 ms 2.2x 77.7 274M
Table 1. Our automated search spaces outperformed manual ones in ImageNet results on two devices. Speedup: INT8 latency compared with FP32 inference.

Potential for sustainable and efficient computing

SpaceEvo is the first attempt to address the hardware-friendly search space optimization challenge in NAS, paving the way for designing effective low-latency DNN models for diverse real-world edge devices. Looking ahead, the implications of SpaceEvo reach far beyond its initial achievements. Its potential extends to applications for other crucial deployment metrics, such as energy and memory consumption, enhancing the sustainability of edge computing solutions.

We are exploring adapting these methods to support diverse model architectures like transformers, further expanding its role in evolving deep learning model design and efficient deployment.

The post Efficient and hardware-friendly neural architecture search with SpaceEvo appeared first on Microsoft Research.

Read More

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

When was the last time you were faced with a task you had no clue how to tackle? Maybe it was fixing a broken bike, replacing a printer toner, or making a cup of espresso? In such circumstances, your usual options might include reaching out to a knowledgeable friend or relative for assistance. Alternatively, you might resort to scouring the internet, conducting a web search, posing questions on online forums, or seeking out relevant instructional videos. But what if there were another option? What if you could turn to an AI assistant, or copilot, for help?

AI in the real world

Our daily lives are filled with a wide range of tasks, both for work and leisure, spanning the digital and physical realms. We often find ourselves in need of guidance to learn and carry out these tasks effectively. Recent advances in AI, particularly in the areas of large language and multimodal models, have given rise to intelligent digital agents. However, when it comes to the physical world, where we perform a significant number of our tasks, AI systems have historically faced greater challenges. 

A longstanding aspiration within the AI community has been to develop an interactive AI assistant capable of perceiving, reasoning, and collaborating with people in the real world. Whether it’s scenarios like autonomous driving, robot navigation and manipulation, hazard detection in industrial settings, or support and guidance for mixed-reality tasks, progress in physical activities has been slower and more incremental compared with their fully digital counterparts.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


The promise and challenge of interactive AI “copilots”

There is great potential for developing interactive AI copilots to assist people with real-world tasks, but there are also obstacles. The key challenge is that current state-of-the-art AI assistants lack firsthand experience in the physical world. Consequently, they cannot perceive the state of the real world and actively intervene when necessary. This limitation stems from a lack of training on the specific data required for perception, reasoning, and modeling in such scenarios. In terms of AI development, there’s a saying that “data is king.” This challenge is no exception. To advance interactive AI agents for physical tasks, we must thoroughly understand the problem domain and establish a gold standard for copilots’ capabilities.

A new multimodal interactive dataset

As a first step in this direction, we are excited to share our paper, “HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World (opens in new tab),” presented at ICCV 2023 (opens in new tab). HoloAssist is a large-scale egocentric, or first-person, human interaction dataset, where two people collaboratively execute physical manipulation tasks. A task performer executes a task while wearing a mixed-reality headset that captures seven synchronized data streams, as shown in Figure 1. Simultaneously, a task instructor observes the performer’s first-person video feed in real time and offers verbal instruction. 

An image illustrating the setup for the HoloAssist dataset, which features a two-person interactive assistive task-completion setting.  A task-performer is wearing a mixed reality headset while an instructor watches the first-person video feed and provides instructions.  Eight modalities are captured, RGB, eye gaze, hand pose, head pose, depth, IMU, audio, text transcription.
Figure 1: HoloAssist features a two-person interactive assistive task-completion setting.

HoloAssist contains a large collection of data, comprising 166 hours of recordings involving 222 diverse participants. These participants form 350 distinct instructor-performer pairs carrying out 20 object-centric manipulation tasks. Video 1 shows how tasks are recorded, while Figure 2 provides a task breakdown. The objects range from common electronic devices to rarer items found in factories and specialized labs. The tasks are generally quite demanding, often requiring instructor assistance for successful completion. To provide comprehensive insights, we’ve captured seven different raw sensor modalities: RGB, depth, head pose, 3D hand pose, eye gaze, audio, and IMU. These modalities help in understanding human intentions, estimating world states, predicting future actions, and more. inally, the eighth modality is an augmentation with third-person manual annotations, consisting of a text summary, intervention types, mistake annotations, and action segments, as illustrated in Figure 3.

Video 1: A sampling of task recordings showcasing color and depth, two of the eight modalities.
Data distribution captured in HoloAssist. On the left, the number of sessions per activity, and on the right, the total length of sessions in minutes. There are 20 tasks: GoPro, Nintendo Switch, DSLR, portable printer, computer, Nespresso machine, standalone printer, big coffee machine, IKEA furniture (stool, utility cart, tray table, nightstand), NavVis laser scanner, ATV motorcycle, wheel belt, and circuit breaker.  There are between 25 and 180 sessions per activity and sessions range from 47 to 1390 minutes.
Figure 2: Data distribution captured in HoloAssist. On the left, the number of sessions per activity. On the right, the total session length in minutes.
HoloAssist includes action and conversational annotations and provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.  The image shows examples of each of these.
Figure 3: HoloAssist includes action and conversational annotations, and it also provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.

Towards proactive AI assistants

Our work builds on previous advancements in egocentric vision and embodied AI. Unlike earlier datasets, such as those listed in Table 1, HoloAssist stands out due to its multi-person, interactive task-execution setting. Human interaction during task execution provides a valuable resource for designing AI assistants that are anticipatory and proactive that can provide precisely timed instructions that are grounded in the environment, in contrast with current “chat-based” AI assistants that wait for you to ask a question. This unique scenario is ideal for developing assistive AI agents and complements existing datasets, which contribute rich knowledge and representation.

The table shows a comparison of nine related datasets and simulation platforms and for each dataset the setting, whether it is collaborative and interactive, instructional and procedural, and the number of hours of video.  HoloAssist features a multi-person assistive setting which is a unique addition to existing first-person (egocentric) datasets.
Table 1: Comparison of related datasets and simulation platforms. HoloAssist features a multi-person assistive setting, which is a unique addition to existing egocentric (first-person) datasets.

Finally, we evaluated the dataset’s performance on action classification and anticipation tasks, providing empirical results that shed light on the role of different modalities in various tasks. With this dataset, we introduce new tasks and benchmarks focused on mistake detection, intervention type prediction, and 3D hand pose forecasting, all crucial elements for developing intelligent assistants.

Looking forward

This work represents an initial step in broader research that explores how intelligent agents can collaborate with humans in real-world tasks. We’re excited to share this work and our dataset with the community and, anticipate numerous future directions, such as annotating object poses, investigating object-centric models of affordance and manipulations in AI assistance, and AI-assisted planning and state tracking, among others. We believe HoloAssist, along with its associated benchmarks and tools, will benefit future research endeavors focused on building powerful AI assistants for real-world everyday tasks. You can access the HoloAssist dataset and code on GitHub (opens in new tab).

Contributors

Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Marc Pollefeys

The post HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world appeared first on Microsoft Research.

Read More

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas

photos of PhD students Jennifer Scurrell and Alejandro Cuevas along with Senior Researcher Dr. Madeleine Daepp for the Microsoft Research podcast

Every year, interns from academic institutions around the world apply and grow their knowledge as members of the research community at Microsoft. In this Microsoft Research Podcast series, these students join their internship supervisors to share their experience working alongside some of the leading researchers in their respective fields. 

In this episode, PhD students Jennifer Scurrell (opens in new tab) and Alejandro Cuevas (opens in new tab) talk to Senior Researcher Dr. Madeleine Daepp (opens in new tab). They discuss the internship culture at Microsoft Research, from opportunities to connect with researchers they admire over coffee to the teamwork they say helped make it possible for them to succeed in the fast-paced environment of industry, and the impact they hope to have with their work.

The post Intern Insights: Dr. Madeleine Daepp with Jennifer Scurrell and Alejandro Cuevas appeared first on Microsoft Research.

Read More

Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI

Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI

abstract colors

The latest advances in artificial intelligence have sparked broad public interest and excitement, and the sciences are no exception. Increasingly capable foundation models are fuelling a fundamental shift in computing research, natural sciences, social sciences, and even computing education itself. As industry-led advances in AI continue to reach new heights, Microsoft Research believes that a vibrant and diverse research ecosystem is essential to realizing the promise of AI. This means ensuring that the academic research community, and especially researchers working outside computer science, can tap into these capabilities. Their depth and breadth of expertise across disciplines, cultures and languages can contribute meaningfully to our ability to use AI to address some of the world’s greatest technical, scientific, and societal challenges.

To this end, Microsoft Research has established Accelerate Foundation Models Research (AFMR), a new initiative that brings together an interdisciplinary research community to pursue three goals:

  • Aligning AI with shared human goals, values, and preferences via research on models, which enhances safety, robustness, sustainability, responsibility, and transparency, while also exploring new evaluation methods to measure the rapidly growing capabilities of new models.
  • Improving human interactions via sociotechnical research, which enables AI to extend human ingenuity, creativity and productivity, while also working to reduce inequities of access and working to ensure positive benefits for people and societies worldwide.
  • Accelerating scientific discovery in natural sciences through proactive knowledge discovery, hypothesis generation, and multiscale multimodal data generation.

AFMR is a global research network and a resource platform that enables researchers in computer science and many other disciplines to engage with some of the greatest technical and societal challenges of our time. This includes a grant program that provides access to state-of-the-art foundation models hosted through Microsoft Azure AI.

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.


The goal is to foster more collaborations across disciplines, institutions, and sectors, and to unleash the full potential of AI for a wide range of research questions, applications, and societal contexts.

Following a successful pilot program and initial call for proposals (CFP), details of which are provided below, we are committed to continuing this work and can expect to solicit additional proposals throughout the coming year. Visit the AFMR site to learn more about upcoming programs and events, read peer-reviewed work that has resulted from the program and find resources to accelerate research and collaborations. 

Inspiring research in the era of AI

When ChatGPT was released in the fall of 2022, it quickly became clear that this new technology and tool would play a central role in AI computing research and applications.

“As a natural language processing (NLP) researcher, I was excited at first by ChatGPT’s potential to stimulate an AI revolution,” said Evelyne Viegas, senior director of research engagement at Microsoft Research. “Soon, I became concerned about a potential lack of access to this resource outside of industry, which could delay important progress in academic settings.”

When Microsoft enabled access to OpenAI models (Embeddings series, GPT-3.5-Turbo series, and GPT-4 series) via the Azure AI services, it created an opportunity to engage with the academic community to learn about their needs and aspirations and start enabling them. A team at Microsoft Research conducted a pilot program offering model access to a small number of participants, and the success of this effort inspired a broader and more sustained program.

Research topics undertaken as part of the pilot reflect the ambitions of AI research at Microsoft in understanding general AI, driving model innovation, ensuring social benefit, transforming scientific discovery, and extending human capabilities across different domains (e.g., astronomy, education, health, law, society).

Although the research supported by this pilot is still underway, the examples below illustrate the possibilities of opening access to leading-edge models to a diverse group of researchers:

Integrating ChatGPT into English as a Foreign Language (EFL) Writing Education – Korea Advanced Institute of Science and Technology (KAIST)

This project explores how students can utilize generative AI for interactive revision in EFL writing. Because the majority of KAIST courses are given in English, the sooner non-English speakers can learn the language the better they will be able to participate in their classes. While earlier chatbots have been used for EFL, language learners found them unengaging. With Azure OpenAI Service, the KAIST team is gathering data to show how the unique capabilities of a GPT-4-based chatbot are accelerating learning while making the learner’s experience more engaging.

Lightweight Adaptation of LLMs for Healthcare Applications – Stanford University

This work focuses on accelerating the task of report summarization for radiologists to improve workflow and decrease the time needed to generate an accurate report. It uses domain adaptation via pretraining on biomedical text, or clinical text and discrete prompting or fine-tuning. Initial results are promising, showing the added value of using foundation models for some clinical tasks.

AI-Based Traffic Monitoring System using Physics-Informed Neural Networks and GPT Models – North Carolina A&T State University

Researchers are creating a traffic monitoring system using data collected from unmanned aerial vehicles (UAVs) to fine-tune foundation models for video analysis and traffic state estimation. This work can directly benefit transportation agencies and city planners, helping them understand traffic patterns, congestion, and safety hazards.

Forging New Horizons in Astronomy – Harvard University

This project seeks to enhance human interaction with astronomy literature utilizing the capabilities of the large language models (LLM), particularly GPT-4. This work employs in-context prompting techniques to expose the model to astronomy papers to build an astronomy-focused chat application to engage the broader community.

Expanding AFMR

Much experimentation remains to be done with foundation models. The AFMR CFP invited the community to develop proposals focused on the goals and questions below:

  • Aligning AI systems with human goals and preferences
  • Advancing beneficial applications of AI
  • Accelerating scientific discovery in the natural and life sciences

The response to the AFMR Fall CFP has been phenomenal, with close to 400 proposals from 170 universities across 33 countries.

“Research undertaken by the principal investigators brings the promise to advance research across a greater breadth of research pursuits, application domains, and societal contexts than we could have imagined,” Viegas said. “It covers a vast range of scientific and sociotechnical topics: creativity, culture, economy, education, finance, health, causality, evaluation, augmentation and adaptation, multimodal, responsible AI, robotics, scientific discovery, software and society. It is inspiring to see experts from different countries with different cultures, languages, institutions, and departments, including computer science, social science, natural sciences, humanities, medicine, music, all come together to work on democratizing AI and work on solving some of the greatest technical and societal challenges of tomorrow.”

The post Accelerate Foundation Models Research: Supporting a global academic research ecosystem for AI appeared first on Microsoft Research.

Read More

Research Focus: Week of September 25, 2023

Research Focus: Week of September 25, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 25 | Week of September 25, 2023

NEW RESEARCH

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Large Language Model (LLM) inference consists of two distinct phases – prefill phase, which processes the input prompt, and decode phase, which generates output tokens autoregressively. While the prefill phase effectively saturates graphics processing unit (GPU) compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles.

In a new paper: SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, researchers from Microsoft present a solution to these challenges that yields significant improvements in inference performance across models and hardware. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. Chunked-prefills allow constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


NEW RESEARCH

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M (opens in new tab). This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories.  

In a new paper: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory, researchers from Microsoft propose an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, DragNUWA simultaneously introduces text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, the researchers propose trajectory modeling with three aspects: a trajectory sampler (TS) to enable open-domain control of arbitrary trajectories, a multiscale fusion (MF) to control trajectories in different granularities, and an adaptive training (AT) strategy to generate consistent videos following trajectories. Their experiments demonstrate DragNUWA’s superior performance in fine-grained control in video generation.

DragNUWA is purely a research project and there are no current plans to incorporate DragNUWA into a product. Any further research will continue to follow Microsoft AI principles.

NEW RESEARCH

Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals

Understanding cortical responses to human visual perception has emerged a research hotspot. Yet, the underlying mechanism of how human visual perceptions are intertwined with our cognitions is still a mystery. Thanks to recent advances in both neuroscience and artificial intelligence, researchers have been able to record the visually evoked brain activities and mimic the visual perception ability through computational approaches. 

In a new paper: Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals, researchers from Microsoft reconstruct observed images based on portably accessible brain signals, i.e., electroencephalography (EEG) data. Since EEG signals are dynamic in the time-series format and are notoriously noisy, processing and extracting useful information requires more dedicated efforts. The researchers propose a comprehensive pipeline, named NeuroImagen, to incorporate a novel multi-level perceptual information decoding to draw multi-grained and heterogeneous outputs from the given EEG data. A pretrained latent diffusion model then leverages the extracted semantic information to reconstruct the high-resolution visual stimuli images. The experimental results illustrate the effectiveness of image reconstruction and superior quantitative performance of the proposed method.

The post Research Focus: Week of September 25, 2023 appeared first on Microsoft Research.

Read More

AutoGen: Enabling next-generation large language model applications

AutoGen: Enabling next-generation large language model applications

“Capabilities like AutoGen are poised to fundamentally transform and extend what large language models are capable of. This is one of the most exciting developments I have seen in AI recently.”

Doug Burger, Technical Fellow, Microsoft

Figure 1 shows three shaded boxes, each containing symbols that represent AutoGen agents and the large language models, tools, and humans that comprise them, and illustrates how AutoGen agents can converse to solve tasks.
Figure 1. AutoGen enables complex LLM-based workflows using multi-agent conversations. (Left) AutoGen agents are customizable and can be based on LLMs, tools, humans, and even a combination of them. (Top-right) Agents can converse to solve tasks. (Bottom-right) The framework supports many additional complex conversation patterns.

It requires a lot of effort and expertise to design, implement, and optimize a workflow that can leverage the full potential of large language models (LLMs). Automating these workflows has tremendous value. As developers begin to create increasingly complex LLM-based applications, workflows will inevitably grow more intricate. The potential design space for such workflows could be vast and complex, thereby heightening the challenge of orchestrating an optimal workflow with robust performance.

AutoGen is a framework for simplifying the orchestration, optimization, and automation of LLM workflows. It offers customizable and conversable agents that leverage the strongest capabilities of the most advanced LLMs, like GPT-4, while addressing their limitations by integrating with humans and tools and having conversations between multiple agents via automated chat.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft


With AutoGen, building a complex multi-agent conversation system boils down to:

  • Defining a set of agents with specialized capabilities and roles.
  • Defining the interaction behavior between agents, i.e., what to reply when an agent receives messages from another agent.

Both steps are intuitive and modular, making these agents reusable and composable. For example, to build a system for code-based question answering, one can design the agents and their interactions as in Figure 2. Such a system is shown to reduce the number of manual interactions needed from 3x to 10x in applications like supply-chain optimization (opens in new tab). Using AutoGen leads to more than a 4x reduction in coding effort.

Figure 2 illustrates an example workflow with dotted-line relationships between three AutoGen agents—Commander, Writer, and Safeguard—and how the agents work together to answer code-based questions from users. (opens in new tab)
Figure 2. An example workflow to address code-based question answering in supply-chain optimization (opens in new tab). The Commander receives user questions and coordinates with the Writer and Safeguard. The Writer crafts the code and interpretation, the Safeguard ensures safety, and the Commander executes the code. If issues arise, the process can repeat until resolved. Shaded circles represent steps that may be repeated multiple times.

Capable, conversable, and customizable agents – integrating LLMs, humans, and tools

AutoGen agents have capabilities enabled by LLMs, humans, tools, or a mix of those elements. For example:

One straightforward way of using built-in agents from AutoGen is to invoke automated chat between an assistant agent and a user proxy agent. As an example (Figure 3), one can easily build an enhanced version of ChatGPT + Code Interpreter + plugins, with a customizable degree of automation, usable in a custom environment and embeddable in a bigger system. It is also easy to extend their behavior to support diverse application scenarios, such as adding personalization and adaptability based on past interactions (e.g., automated continual learning (opens in new tab), teach agents new skills (opens in new tab)).

Figure 3 shows the details of a chat between an assistant agent and a user proxy agent to illustrate how AutoGen automates such chats, while seamlessly engaging humans or using tools as needed to complete complex tasks.
Figure 3. A user proxy agent and assistant agent from AutoGen can be used to build an enhanced version of ChatGPT + Code Interpreter + plugins. The assistant agent plays the role of an AI assistant like Bing Chat. The user proxy agent plays the role of a user and simulates users’ behavior such as code execution. AutoGen automates the chat between the two agents, while allowing human feedback or intervention. The user proxy seamlessly engages humans and uses tools when appropriate.

The agent conversation-centric design has numerous benefits, including that it:

  • Naturally handles ambiguity, feedback, progress, and collaboration.
  • Enables effective coding-related tasks, like tool use with back-and-forth troubleshooting.
  • Allows users to seamlessly opt in or opt out via an agent in the chat.
  • Achieves a collective goal with the cooperation of multiple specialists.

AutoGen supports automated chat and diverse communication patterns, making it easy to orchestrate a complex, dynamic workflow and experiment with versatility. Figure 4 illustrates a new game, conversational chess (opens in new tab), enabled by AutoGen. Figure 5 illustrates how AutoGen supports group chats (opens in new tab) between multiple agents using another special agent called the “GroupChatManager”.

Figure 4 displays two small chessboards side-by-side, with black and white chess pieces in various positions on each board showing a game in progress, plus a chat between two users, to illustrate how AI, human, or hybrid users can play conversational chess. (opens in new tab)
Figure 4. An example of a new application enabled by AutoGen: conversational chess (opens in new tab). It can support various scenarios, as each player can be an LLM-empowered AI, a human, or a hybrid of the two. It allows players to express their moves creatively, such as using jokes, meme references, and character-playing, making chess games more entertaining to players as well as observers.
Figure 5 shows three shaded boxes, each containing symbols that represent various agents, to illustrate how AutoGen enables dynamic group chats. Each box represents a different step in the three-step process. (opens in new tab)
Figure 5. Overview of how AutoGen enables dynamic group chats (opens in new tab) to solve tasks: We use a special agent called the Manager that repeats the following three steps—select a single speaker (in this case Bob), ask the speaker to respond, and broadcast the selected speaker’s message to all the other agents.

(opens in new tab)Getting started

AutoGen (opens in new tab) (in preview) is freely available as a Python package. To install it, run

pip install pyautogen

You can quickly enable a powerful experience with just a few lines of code:

import autogen
assistant = autogen.AssistantAgent("assistant")
user_proxy = autogen.UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Show me the YTD gain of 10 largest technology companies as of today.")
# This triggers automated chat to solve the task

Check examples for a wide variety of tasks: https://microsoft.github.io/autogen/docs/Examples/AutoGen-AgentChat (opens in new tab).

Next steps:

AutoGen is an open-source, community-driven project under active development (as a spinoff from FLAML (opens in new tab), a fast library for automated machine learning and tuning), which encourages contributions from individuals of all backgrounds. Many Microsoft Research collaborators have made great contributions to this project, including academic contributors like Pennsylvania State University and the University of Washington, and product teams like Microsoft Fabric and ML.NET. AutoGen aims to provide an effective and easy-to-use framework for developers to build next-generation applications, and already demonstrates promising opportunities to build creative applications and provide a large space for innovation.

Names of Microsoft contributors: 

Chi Wang, Gagan Bansal, Eric Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, Ahmed Awadallah, Ryen White, Doug Burger, Robin Moeur, Victor Dibia, Adam Fourney, Piali Choudhury, Saleema Amershi, Ricky Loynd, Hamed Khanpour

The post AutoGen: Enabling next-generation large language model applications appeared first on Microsoft Research.

Read More

Neural Graphical Models

Neural Graphical Models

This research paper was presented at the 17th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (opens in new tab), a premier forum for advances in the theory and practice of reasoning under uncertainty.

ECSQARU Blog Hero:
Neural Graphical Models

In the field of reasoning under uncertainty, probabilistic graphical models (PGMs) stand out as a powerful tool for analyzing data. They can represent relationships between features and learn underlying distributions that model functional dependencies between them. Learning, inference, and sampling are operations that make graphical models useful for domain exploration.  

In a broad sense, learning involves fitting the distribution function parameters from data, and inference is the procedure of answering queries in the form of conditional distributions with one or more observed variables. Sampling entails the ability to extract samples from the underlying distribution as defined by the graphical model. A common challenge with graphical model representations lies in the high computational complexity of one or more of these operations.   

Various graphical models impose restrictions on the set of distributions or types of variables in the domain. Some graphical models work with continuous variables only (or categorical variables only) or place restrictions on the graph structure, for example, the constraint that continuous variables cannot be parents of categorical variables in a directed acyclic graph (DAG). Other restrictions affect the set of distributions the models can represent, for example, only multivariate Gaussian distributions.

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.


In our paper, “Neural Graphical Models (opens in new tab),” presented at ECSQARU 2023 (opens in new tab), we propose Neural Graphical Models (NGMs), a new type of PGM that learns to represent the probability function over the domain using a deep neural network. The parameterization of such a network can be learned from data efficiently, with a loss function that jointly optimizes adherence to the dependency structure, given as input in the form of a directed or undirected graph, and fit to the data. Probability functions represented by NGMs are unrestricted by any of the common restrictions inherent in other PGMs. NGMs can handle various input types: categorical, continuous, images and embedding representations. They also support efficient inference and sampling.

Figure 1 - The image on the left shows an undirected network graph with five variables: x1, x2, x3, x4 and x5. The variable x3 is connected to all other variables, and x1 is directly connected to x3 and x4 only. The annotation next to the nodes indicates that the value of each variable is a function of the values of its neighbors. For example, the value of x1 is a function of x3 and x4, the value of x2 is a function of x3, and so on. On the right, we see a table representing the adjacency matrix for the same graph, with both rows and columns labeled with variables names from x1 to x5. The cells show either ones or zeros. The ones indicate a presence of an edge, for example in the cell on the intersection of the row labeled x1 and the column labeled x3.
Figure 1: Graphical view of NGMs: The input graph G (undirected) for given input data X. Each feature ( x_i=f_i(text{Nbrs}(x_i))) is a function of the neighboring features. For a DAG, the functions between features will be defined by the Markov Blanket relationship ( x_i=f_i(text{MB}(x_i))). On the right, the adjacency matrix represents the associated dependency structure S.
Figure 2 - The image shows a neural network. The input layer has five variables: x1, x2, …, x5, and the corresponding output layer has the same five variables. Between the input and output layers there is one hidden layer with six nodes. Some of the units in the input layer are connected to the units in the hidden layer, and some of the units in the hidden layer are connected to the units in the output layer. A careful examination shows that there is a path from a unit xi in the input layer to a unit xj in the output layer whenever there is an edge from the xi node to the xj node in the graph in Figure 1. Note that there are no self-paths, that is, paths from xi in the input layer to xi in the output layer. Some of the remaining neural network connections representing zeroed-out weights are shown in dashed black lines.
Figure 2: Neural view of NGMs: This is a neural network as a multitask learning architecture capturing nonlinear dependencies for the features of the undirected graph in Figure 1. The presence of a path from the input to the output features indicates a dependency between them. The dependency matrix between the input and output of the NN reduces to matrix product operation (S_{nn}=Pi_i|W_i|=|W_1|times|W_2|). Note that not all the zeroed-out weights of the MLP (in black-dashed lines) are shown for the sake of clarity.

Experimental validations for NGMs

In our paper (opens in new tab), we evaluate NGMs’ performance, inference accuracy, sensitivity to the input graph, and ability to recover the input dependency structure when trained on both real and synthetic data: Infant mortality data (opens in new tab) from the Centers for Disease Control and Prevention (CDC), synthetic Gaussian Graphical model data, and lung cancer data from Kaggle. 

The infant mortality dataset (opens in new tab) describes pregnancy and birth variables for all live births in the US and, in instances of infant death before the first birthday, the cause of death. We used the latest available data, which includes information about 3,988,733 live births in the US during 2015. It was particularly challenging to evaluate the inference accuracy of NGMs using this dataset due to the (thankfully) rare occurrence of infant deaths during the first year of life, making queries concerning such low probability events hard to accurately estimate.  

We used the CDC data to evaluate the NGMs’ inference accuracy. We compared their prediction for four variables of various types: gestational age (ordinal, expressed in weeks), birth weight (continuous, specified in grams), survival until the first birthday (binary) and the cause of death. We used the categories of “alive,” the 10 most common causes of death, or “other” for the less common causes. Here, “alive” was indicated for 99.48% of infants. We also compared the performance of logistic regression, Bayesian networks, Explainable Boosting Machines (EBM), and NGMs. In case of NGMs, we trained two models: one using the Bayesian network graph and one using the uGLAD graph.

Our results demonstrate that NGM are significantly more accurate than logistic regression, more accurate than Bayesian networks, and on par with EBM models for categorical and ordinal variables. They particularly shine when predicting very low probability categories for multi-valued variable cause of death, where, in contrast most models (such as both PGMs and classification models) typically struggle. Note that while we need to train a separate LR and EBM model for each outcome variable evaluated, all variables can be predicted within one trained NGM model. Interestingly, the two NGM models show similar accuracy results despite the differences in the two dependency structures used in training. 

We believe that NGMs are an interesting amalgam of the deep learning architectures’ expressivity, and PGMs’ representation capabilities and can be applied in many domains, given that they place no restrictions on input types and distributions. We encourage you to explore NGMs and take advantage of the ability to work with a wider range of distributions and inputs. You can access the code for Neural Graphical Models on GitHub (opens in new tab).

The post Neural Graphical Models appeared first on Microsoft Research.

Read More

Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies

Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies

DeepSpeed4Science Initiative - graphic with 6 icons

Introduction 

In the next decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. In line with Microsoft’s mission to empower every person and every organization on the planet to achieve more, the DeepSpeed (opens in new tab) team at Microsoft is responding to this opportunity by launching a new initiative called DeepSpeed4Science (opens in new tab), aiming to build unique capabilities through AI system technology innovations to help domain experts to unlock today’s biggest science mysteries.

The DeepSpeed (opens in new tab) system is an industry leading open-source AI system framework, developed by Microsoft, that enables unprecedented scale and speed for deep learning training and inference on a wide range of AI hardware. Figure 1 demonstrates our basic approach to this new initiative. By leveraging DeepSpeed’s current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). We work closely with internal and external teams who own AI-driven science models that represent key science missions, to identify and address general domain-specific AI system challenges. This includes climate science, drug design, biological understanding, molecular dynamics simulation, cancer diagnosis and surveillance, catalyst/material discovery, and other domains.

Figure 1: It is a three-tier diagram. From bottom to top wise (vertically), it describes our basic approach for executing DeepSpeed4Science initative. Bottom section represents the current three pillars of
the DeepSpeed framework, including training, inference and compression. The middle layer, which is what this particular blog is about, is creating a new set of AI system technologies that are beyond generic large language model support, tailored for accelerating scientific discoveries and addressing their complexity. The very top layer represents gemera; AI-driven science models across different domains, which can be supported by DeepSpeed4Science software support.
Figure 1: DeepSpeed4Science approach: developing a new set of AI system technologies that are beyond generic large language model support, tailored for accelerating scientific discoveries and addressing their complexity.

Our long-term vision is to develop DeepSpeed4Science into a new platform and a unified repository for sharing advanced AI system technologies that support scientific discoveries. DeepSpeed4Science is designed to be inclusive, echoing Microsoft’s AI for Good commitment. That is reflected in the initiative’s support for a diverse group of signature science models, representing some of the most critical AI for science investments. In this blog, we showcase how DeepSpeed4Science helps address two of their critical system challenges in structural biology research: (1) eliminating memory explosion problems for scaling Evoformer-centric protein-structure prediction models, and (2) enabling very-long sequence support for better understanding the evolutionary landscape of pandemic-causing viruses.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Our launch and key collaborators 

The new system technologies enabled by DeepSpeed4Science can empower AI-driven scientific discoveries using signature models that represent a wide spectrum of efforts pushing the boundaries of science. Currently, DeepSpeed4Science is honored to support several key science models from Microsoft Research AI4Science (opens in new tab), Microsoft WebXT/Bing (opens in new tab) and U.S. DoE National Labs (opens in new tab).

Current Microsoft internal partnerships 

Scientific Foundation Model (SFM), Microsoft Research AI4Science

Figure 2: This figure contains two peices. The top piece represents the general methodology of buliding this scientific foundtaion model (SFM). The bottom section is a GIF that illustrates one important apporach that has been developed by Microsoft on protein structure prediction through Distributional Graphormer. Unlike the other protein prediction methods on the market, Distributional Graphormer claims that molecules are not rigid, rather they are dynamic that can adopt different structures with different probabilities at equilibrium. Distributional Graphormer is the first computational method that can predict equilibrium distribution of molecules by advanced generative AI technology.
Figure 2: This figure contains two peices. The top piece represents the general methodology of buliding this scientific foundtaion model (SFM). The bottom section is a GIF that illustrates one important apporach that has been developed by Microsoft on protein structure prediction through Distributional Graphormer. Unlike the other protein prediction methods on the market, Distributional Graphormer claims that molecules are not rigid, rather they are dynamic that can adopt different structures with different probabilities at equilibrium. Distributional Graphormer is the first computational method that can predict equilibrium distribution of molecules by advanced generative AI technology.
Figure 2: Scientific foundation model (SFM) and its current exploration: Distributional Graphormer.

Scientific foundation model (SFM) aims to create a unified large-scale foundation model to empower natural scientific discovery by supporting diverse inputs, multiple scientific domains (e.g., drugs, materials, biology, health, etc.) and computational tasks. The DeepSpeed4Science partnership will provide new training and inference technologies to empower the SFM team’s continuous research on projects like Microsoft’s new generative AI methods, such as Distributional Graphormer.

ClimaX, MSR AI4Science

Figure 3: The diagram of a foundation model for weather modeling is shown here. Our changing climate is producing more frequent extreme weather events. To mitigate the negative effects, it is increasingly important to predict where these events will occur. ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. It can absorb many different datasets with different variables and resolutions, potentially improving weather forecasting.
Figure 3: ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks.

Our changing climate is producing more frequent extreme weather events. To mitigate the negative effects, it is increasingly important to predict where these events will occur. ClimaX is the first foundation model designed to perform a wide variety of weather and climate modeling tasks. It can absorb many different datasets with different variables and resolutions, potentially improving weather forecasting. DeepSpeed4Science is creating new system supports and acceleration strategies for ClimaX for efficiently pretraining/finetuning bigger foundation models while handling very large high-resolution image data (e.g., tens to hundreds of petabytes) with long sequences.

AI Powered Ab Initio Molecular Dynamics (AI2MD), MSR AI4Science

Figure 4:This animated figure illustrates one million steps of a molecular dynamics simulation, e.g., RBD-protein interacts with protein inhibitor. Simulations like this are efficient enough to generate trajectories long enough to observe chemically significant events.
Figure 4: One million steps of molecular dynamics simulation: RBD-protein interacts with protein inhibitor.

This project simulates the dynamics of large (million-atom) molecular systems with near ab initio accuracy using AI-powered force field models while maintaining the efficiency and scalability of classical molecular dynamics. The simulations are efficient enough to generate trajectories long enough to observe chemically significant events. Typically, millions or even billions of inference steps are required for this process. This poses a significant challenge in optimizing the inference speed of graph neural network (GNN)+ LLM models, for which DeepSpeed4Science will provide new acceleration strategies.

Weather from Microsoft Start, Microsoft WebXT/Bing

Figure 5: This figure shows Microsoft Start precipitation nowcast application on Bing, i.e., every 4 minutes for the next 4 hours. Weather from Microsoft Start provides precise weather information to help users make better decisions for their lifestyles, health, jobs and activities – including accurate 10-day global weather forecasts updated multiple times every hour.
Figure 5: Microsoft Start precipitation nowcast (every 4 minutes for the next 4 hours).

Weather from Microsoft Start (opens in new tab) provides precise weather information to help users make better decisions for their lifestyles, health, jobs and activities (opens in new tab) – including accurate 10-day global weather forecasts updated multiple times every hour.  Previously, Weather from Microsoft Start benefited from DeepSpeed technologies to accelerate their multi-GPU training environments. Currently, DeepSpeed4Science is working with the WebXT weather team to further enhance Microsoft Weather services with cutting-edge features and improvements.

Current external collaborators 

DeepSpeed4Science’s journey started with two pioneering LLM-based AI models for structural biology research: OpenFold (opens in new tab) from Columbia University, an open-sourced high-fidelity protein structure prediction model; and GenSLMs (opens in new tab) from Argonne National Laboratory (opens in new tab), an award-winning genome-scale language model (opens in new tab) for learning the evolutionary landscape of SARS-CoV-2 (COVID-19) genomes. As the featured showcases for this release, they represent two common AI system challenges facing today’s AI-driven structural biology research. We will discuss how DeepSpeed4Science empowered their scientific discovery in the next section.  

Additionally, DeepSpeed4Science has recently expanded its scope to support a more diverse range of science models. For example, in our work with Argonne on training a trillion-parameter science model on Aurora Exascale system (opens in new tab), DeepSpeed4Science technologies will help them reach the performance requirements and scalability needed for this critical mission. Furthermore, by collaborating with Oak Ridge National Lab (opens in new tab) and National Cancer Institute (NCI) (opens in new tab) on cancer surveillance, DeepSpeed4Science will help enable high-fidelity extraction and classification of information from unstructured clinical texts for the MOSSAIC project (opens in new tab).  DeepSpeed4Science technologies will also be adopted by Brookhaven National Laboratory (opens in new tab) to support development of a large digital twin model for clean energy research by using LLMs to produce more realistic simulation data. You can find more detailed information about our external colleagues and their science missions at DeepSpeed4Science (opens in new tab).

Partnership showcases 

Showcase (I): DeepSpeed4Science eliminates memory explosion problems for scaling Evoformer-centric structural biology models via DS4Sci_EvoformerAttention

Figure 6: The top figure illustrates the prediction demonstration from AlphaFold2 and OpenFold against the baseline experiemental result. OpenFold is a community reproduction of DeepMind’s AlphaFold2 that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (shown as the bottom figure), and developed new protein folding systems. The bottom figure demonstrates OpenFold's predictions for PDB chain 7B3A_A as the model trains.
Figure 6: The top figure illustrates the prediction demonstration from AlphaFold2 and OpenFold against the baseline experiemental result. OpenFold is a community reproduction of DeepMind’s AlphaFold2 that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (shown as the bottom figure), and developed new protein folding systems. The bottom figure demonstrates OpenFold's predictions for PDB chain 7B3A_A as the model trains.
Figure 6: OpenFold predictions for PDB chain 7B3A_A as the model trains.

OpenFold (opens in new tab) is a community reproduction of DeepMind’s AlphaFold2 (opens in new tab) that makes it possible to train or finetune AlphaFold2 on new datasets. Researchers have used it to retrain AlphaFold2 from scratch to produce new sets of model parameters, studied the early training phase of AlphaFold2 (Figure 6), and developed new protein folding systems.

Figure 7: It shows the peak memory requirement for training variants of the multiple sequence alignment (MSA) attention kernels (with bias) with the maximum possible training sample dimension in OpenFold. (Left) The original OpenFold implementation with EvoformerAttention used in AlphaFold2. The memory explosion problems in training/inference for these types of protein structure prediction models are common. Particularly, state-of-the-art FlashAttention cannot effectively support such science attention variants. (Right) A new solution from DeepSpeed4Science called DS4Sci_EvoformerAttention significantly reduces OpenFold’s peak memory requirement for training by 13X without accuracy loss.
Figure 7: Peak memory requirement for training variants of the multiple sequence alignment (MSA) attention kernels (with bias) with the maximum possible training sample dimension in OpenFold. (Left) The original OpenFold implementation with EvoformerAttention used in AlphaFold2. The memory explosion problems in training/inference for these types of protein structure prediction models are common. Particularly, state-of-the-art FlashAttention cannot effectively support such science attention variants. (Right) A new solution from DeepSpeed4Science called DS4Sci_EvoformerAttention significantly reduces OpenFold’s peak memory requirement for training by 13X without accuracy loss.

While OpenFold does include performance and memory optimizations using state-of-the-art system technologies, training AlphaFold2 from scratch is still computationally expensive. The model at the current stage is small in absolute terms, with just 93 million parameters, but it contains several custom attention variants that manifest unusually large activations. During the “finetuning” phase of a standard AlphaFold2 training run, the logit tensor produced in just one of these variants–one designed to attend over the deep protein MSAs fed to the model as input–is in excess of 12GB in half precision alone, dwarfing the peak memory requirements of comparably sized language models. Even with techniques like activation checkpointing and DeepSpeed ZeRO optimizations, this memory explosion problem heavily constrains the sequence lengths and MSA depths on which the model can be trained. Furthermore, approximation strategies can significantly affect the model accuracy and convergence, while still resulting in memory explosion, shown as the left bar (orange) in Figure 7.  

To address this common system challenge in structural biology research (e.g., protein structure prediction and equilibrium distribution prediction), DeepSpeed4Science is addressing this memory inefficiency problem by designing customized exact attention kernels for the attention variants (i.e., EvoformerAttention), which widely appear in this category of science models. Specifically, a set of highly memory-efficient DS4Sci_EvoformerAttention kernels enabled by sophisticated fusion/tiling strategies and on-the-fly memory reduction methods, are created for the broader community as high-quality machine learning primitives. Incorporated into OpenFold, they provide a substantial speedup during training and dramatically reduce the model’s peak memory requirement for training and inference. This allows OpenFold to be experimented with bigger and more complex models, and longer sequences, and trained on a wider spectrum of hardware. Detailed information about this technology can be found at DeepSpeed4Science (opens in new tab).

Showcase (II): DeepSpeed4Science enables very-long sequence support via both systematic and algorithmic approaches for genome-scale foundation models (e.g., GenSLMs)

Figure 8. The dynamic figure dipicts GenSLMs, 2022 ACM Gordon Bell Winning COVID Model (a 25B/33B dense model based on GPT-NeoX). It is used to learn the latent space that describes biologically meaningful properties for SARS-CoV-2 genomes. This GIF is visualizing an important protein family, malate dehydrogenase, and viewing a projection of the latent space colored by important features such as sequence length and GC content (the ratio of the content of the nucleic acids guanine and cytosine in comparison to adenine and thymine. It measures the ability of a DNA strand to withstand heat).
Figure 8: GenSLMs: 2022 ACM Gordon Bell Winning COVID Model (a 25B/33B dense model based on GPT-NeoX). It is used to learn the latent space that describes biologically meaningful properties for SARS-CoV-2 genomes. This GIF is visualizing an important protein family, malate dehydrogenase, and viewing a projection of the latent space colored by important features such as sequence length and GC content (the ratio of the content of the nucleic acids guanine and cytosine in comparison to adenine and thymine. It measures the ability of a DNA strand to withstand heat).

GenSLMs (opens in new tab), a 2022 ACM Gordon Bell award (opens in new tab) winning genome-scale language model from Argonne National Lab, can learn the evolutionary landscape of SARS-CoV-2 (COVID-19) genomes by adapting large language models (LLMs) for genomic data. It is designed to transform how new and emergent variants of pandemic-causing viruses, especially SARS-CoV-2, are identified and classified. GenSLM represents one of the first whole genome-scale foundation models which can generalize to other prediction tasks. A good understanding of the latent space can help GenSLMs tackle new domains beyond just viral sequences and expand their ability to model bacterial pathogens and even eukaryotic organisms, e.g., to understand things such as function, pathway membership, and evolutionary relationships. To achieve this scientific goal, GenSLMs and similar models require very long sequence support for both training and inference that is beyond generic LLMs’ long-sequence strategies like FlashAttention (opens in new tab). Through DeepSpeed4Science’s new designs, scientists can now build and train models with significantly longer context windows, allowing them to explore relationships that were previously inaccessible.

DeepSpeed - Figure 9. The two figures show the maximum sequence lengths of GenSLM models (25 billion parameters and 33 billion parameters) supported by different frameworks at different scales. The hardware profiled here are NVIDIA DGX nodes with eight 40G A100 GPUs per node.
Figure 9: Maximum sequence lengths of GenSLM models supported by different frameworks at different scales. The hardware profiled here are NVIDIA DGX nodes with eight 40G A100 GPUs per node.

Specifically, at system level, we release the newest Megatron-DeepSpeed (opens in new tab) framework for very-long sequence support along with other new optimizations (opens in new tab). Scientists can now train their large science models like GenSLMs with much longer sequences via a synergetic combination of our newly added memory optimization techniques on attention mask and position embedding, tensor parallelism, pipeline parallelism, sequence parallelism, ZeRO-style data parallelism and model state offloading. Figure 9 demonstrates that our new release enables the longest sequence length for GenSLMs’ 25B and 33B models by up to 12X and 14X, respectively, over the previous Megatron-DeepSpeed. In terms of supported sequence lengths, this new framework also significantly outperforms NVIDIA’s Megatron-LM by up to 9.8X and 9.1X for the 25B and 33B models, respectively. For example, GenSLMs’ 25B model can now be trained with a 512K sequence of nucleotides, compared to the Argonne team’s original 42K sequence length on 64 GPUs. This drastically improves model quality and scientific discovery scope with no accuracy loss. Additional support for domain scientists who prefer algorithmic strategies like relative position embedding techniques is also integrated in this new release (opens in new tab).

Summary and roadmap 

We are very proud and excited to announce the DeepSpeed4Science initiative along with several R&D highlights and achievements. Starting today, we will host our new initiative at DeepSpeed4Science (opens in new tab), including information about our external colleagues, and current and future DeepSpeed4Science technology releases. One of our high-level goals is to generalize AI system technologies that broadly address the major system pain points for large-scale scientific discoveries. We hope scientists around the world will enjoy the new capabilities unlocked by DeepSpeed4Science through open-sourced software. We are looking forward to better understanding the AI system design challenges that block your discovery progress. We sincerely welcome your participation to help us build a promising AI4Science future. Please email us at deepspeed-info@microsoft.com (opens in new tab). We encourage you to report issues, contribute PRs, and join discussions on our DeepSpeed GitHub (opens in new tab) page.

Acknowledgements 

Core DeepSpeed4Science Team:  

Shuaiwen Leon Song (DeepSpeed4Science lead), Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Xiaoxia (Shirley) Wu, Masahiro Tanaka, Martin Cai, Adam Graham, Charlie Zhou, Yuxiong He (DeepSpeed team lead)

Our Founding Collaborators (in alphabetical order):

Argonne National Lab team: Rick Stevens, Cristina Negri, Rao Kotamarthi, Venkatram Vishwanath, Arvind Ramanathan, Sam Foreman, Kyle Hippe, Troy Arcomano, Romit Maulik, Maxim Zvyagin, Alexander Brace, Yuntian Deng, Bin Zhang, Cindy Orozco Bohorquez, Austin Clyde, Bharat Kale, Danilo Perez-Rivera, Heng Ma, Carla M. Mann, Michael Irvin, J. Gregory Pauloski, Logan Ward, Valerie Hayot, Murali Emani, Zhen Xie, Diangen Lin, Maulik Shukla, Weili Nie, Josh Romero, Christian Dallago, Arash Vahdat, Chaowei Xiao, Thomas Gibbs, Ian Foster, James J. Davis, Michael E. Papka, Thomas Brettin, Anima Anandkumar

AMD: Ivo Bolsen, Micheal Schulte, Bo Begole, Angela Dalton, Steve Reinhart, Ashwin Aji, Jalal Mahmud, Mahesh Balashibramanian 

Brookhaven National Lab team: Adolfy Hoisie, Shinjae Yoo, Yihui Ren. 

Columbia University OpenFold team: Mohammed AlQuraishi, Gustaf Ahdritz 

Microsoft Research AI4Science team: Christopher Bishop, Bonnie Kruft, Max Welling, Tie-Yan Liu, Christian Bodnar, Johannes Brandsetter, Wessel Bruinsma, Chan Cao, Yuan-Jyue Chen, Peggy Dai, Patrick Garvan, Liang He, Elizabeth Heider, PiPi Hu, Peiran Jin, Fusong Ju, Yatao Li, Chang Liu, Renqian Luo, Qi Meng, Frank Noe, Tao Qin, Janwei Zhu, Bin Shao, Yu Shi, Wenlei Shi, Gregor Simm, Megan Stanley, Lixin Sun, Yue Wang, Tong Wang, Zun Wang, Lijun Wu, Yingce Xia, Leo Xia, Shufang Xie, Shuxin Zheng, Jianwei Zhu  

Oakridge National Lab team: Prassana Balaprakash, Georgia Tourass 

Princeton University: William Tang, Kyle Felker, Alexey Svyatkovskiy (Microsoft liaison) 

Rutgers University: Hang Liu

WebXT Weather team: Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov 

The post Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies appeared first on Microsoft Research.

Read More