Research Focus: Week of May 13, 2024

Research Focus: Week of May 13, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: May 13, 2024

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning 

Large language models (LLMs) have shown remarkable performance in generating text similar to that created by people, proving to be a valuable asset across various applications. However, adapting these models to incorporate new, out-of-domain knowledge remains a challenge, particularly for facts and events that occur after the model’s training knowledge cutoff date.

In a recent paper: Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning, researchers from Microsoft investigate the effectiveness of supervised fine-tuning (SFT) as a method for knowledge injection in LLMs, specifically focusing on recent sporting events. They compare different dataset generation strategies—token-based and fact-based scaling—to create training data that helps the model learn new information. Their experiments on GPT-4 demonstrate that while token-based scaling can lead to improvements in Q&A accuracy, it may not provide uniform coverage of new knowledge. Fact-based scaling, on the other hand, offers a more systematic approach to ensure even coverage across all facts. The researchers present a novel dataset generation process that leads to more effective knowledge ingestion through SFT, and results show considerable performance improvements in Q&A tasks related to out-of-domain knowledge. 


A Reflection on Human-Notebook Experiences in the Era of AI

Computational notebooks provide an interactive way to work with data. They have been widely used by data professionals to write code, explore data, and generate visualizations, all in one document. Previous research has revealed unique pain points around the user experience in computational notebooks. However, as AI tools like ChatGPT or Copilot have emerged, it is unclear whether these pain points have been reduced or changed, or whether new pain points have arisen. Due to the fast pace of advances in AI technology, most of the development of new AI tools has been primarily driven by technology and not by user experience.

In a recent paper: A Reflection on Human-Notebook Experiences in the Era of AI, researchers from Microsoft summarize literature on how new AI technology has impacted human-notebook interaction and human-computer interaction (HCI) paradigms, new challenges and user behavior around using AI assistants, and recent research on AI assistants in computational notebook scenarios. They outline gaps in existing literature and suggest a future focus on improving macro human-notebook experiences throughout a user’s workflow, measuring and quantifying the value of AI systems, and establishing a set of standards and best practices for AI tools.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Jacdac: Service-Based Prototyping of Embedded Systems

The traditional approach to programming embedded systems is monolithic: firmware on a microcontroller contains both application code and the drivers needed to communicate with sensors and actuators, using low-level protocols such as I2C, SPI, and RS232. In comparison, software development for the cloud has moved to a service-based development and operation paradigm: a service provides a discrete unit of functionality that can be accessed remotely by an application, or other service, but is independently managed and updated.

In a recent paper: Jacdac: Service-Based Prototyping of Embedded Systems (opens in new tab), researchers from Microsoft propose, design, implement, and evaluate a service-based approach to prototyping embedded systems called Jacdac (opens in new tab). Jacdac defines a service specification language, designed especially for embedded systems, along with a host of specifications for a variety of sensors and actuators. With Jacdac, each sensor/actuator in a system is paired with a low-cost microcontroller that advertises the services that represent the functionality of the underlying hardware over an efficient and low-cost single-wire bus protocol. A separate microcontroller executes the user’s application program, which is a client of the Jacdac services on the bus. 

Three Jacdac kits, comprising over twenty modules, have been produced by third-party manufacturers: KittenBot (opens in new tab) and Forward Education (opens in new tab).


PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models

Evaluation of multilingual LLMs is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data, and the lack of local, cultural nuances in translated benchmarks. Hence, it is difficult to extensively evaluate LLMs in a multilingual setting, leading to lack of fair comparisons between models and difficulties in replicating the evaluation setup used by some models. Recently, several Indic (Indian language) LLMs have been created to help build more locally and culturally relevant LLMs.

In a recent paper: PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models, researchers from Microsoft present an evaluation framework, which is the first comprehensive evaluation of Indic LLMs using a combination of human and LLM-based evaluation. The researchers conduct a total of 90,000 human evaluations and 50,000 LLM-based evaluations of 29 models to present leaderboards for 10 Indic languages. Pariksha provides inclusive evaluation by engaging a community of workers that represent India’s large and diverse workforce and also serves as a research platform for improving the process of evaluation. For transparency on the process, the evaluation artifacts will be released. Conducting Pariksha at regular intervals, the researchers aim to enable models to improve over time with insights and artifacts from their evaluations. 


Tinker, Tailor, Configure, Customize: The Articulation Work of Customizing AI Fairness Checklists

Many responsible AI resources, such as toolkits, playbooks, and checklists, have been developed to support AI practitioners in identifying, measuring, and mitigating potential fairness-related harms. These resources are often designed to be general purpose, in order to address a variety of use cases, domains, and deployment contexts. However, this can lead to decontextualization, where such resources lack the level of relevance or specificity needed to use them.

To understand how AI practitioners might contextualize one such resource, an AI fairness checklist, for their particular use cases, domains, and deployment contexts, researchers from Microsoft conducted a retrospective contextual inquiry with 13 AI practitioners from seven organizations. In a recent paper: Tinker, Tailor, Configure, Customize: The Articulation Work of Customizing AI Fairness Checklists, they identify how contextualizing this checklist introduces new forms of work for AI practitioners and other stakeholders, while opening up new sites for negotiation and contestation of values in AI. The researchers also identify how the contextualization process may help AI practitioners develop a shared language around AI fairness. They also identify dynamics related to ownership over this process that suggest larger issues of accountability in responsible AI work. 


MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

LLMs are becoming indispensable tools for many creative and information related tasks, but they still come with limitations, including a tendency to fabricate content. State-of-the-art algorithms pair the LLM with an external, dynamically updated knowledge base to ground the LLM’s answers and provide up-to-date information. However, these techniques require large amounts of relevant, labeled training data that have not previously been publicly available. 

In a recent paper: MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels presented at the 2024 ACM Web Conference, researchers from Microsoft introduce a novel dataset that closely mimics real-world web document and query distribution. MS MARCO Web Search contains 10 million unique queries across 93 languages with millions of relevant labeled query-document pairs. It uses ClueWeb22’s 10 billion high-quality web pages as the document corpus and provides rich information for various kinds of downstream tasks. 

This dataset unlocks several new research directions that previous datasets cannot well support, including generic end-to-end neural indexer models, generic embedding models, and next generation information access systems with LLMs. MS MARCO Web Search offers a retrieval benchmark with three web scale retrieval challenge tasks, each with automatic evaluation and leaderboard. These tasks demand innovation in both machine learning and information retrieval systems. The researchers intend for MS MARCO Web Search to lay the groundwork for future advancements in AI and systems research.


AI Case Studies for Natural Science Research with Bonnie Kruft

Among the stunning changes and disruptions driven by AI, one of the most significant is the impact on scientific discovery. In her presentation at EmTech Digital 2024 (opens in new tab), Bonnie Kruft, partner deputy director at Microsoft Research AI for Science, outlined some examples of how generative AI enables groundbreaking research in the natural sciences. Recent breakthroughs aided by AI include small molecular inhibitors for treating infectious disease, the discovery of new materials for energy storage, and new drug development. 

Catch a replay of the presentation, including a follow-up Q&A with the audience, and hear how researchers are reducing discovery times from years to months. The discussion explores safe and responsible AI practices, how large language models can work with science-based models, and what lies ahead for AI in science. 

Microsoft Research in the news


The tiny glass blocks that can preserve your data for centuries 

The Times UK | April 27, 2024

Microsoft’s Project Silica is an innovative form of long-term storage – potentially revolutionizing how important data can be preserved for future generations.


These Recyclable Circuit Boards Could Stem E-Waste 

IEEE Spectrum | May 2, 2024

New research from the University of Washington and Microsoft show that vitrimer-based PCBs can be broken down into a gel for repeated reuse. The research stems from the Microsoft Research Climate Initiative.


Today’s AI models are impressive. Teams of them will be formidable 

The Economist | May 13, 2024

Teams of LLMs are more capable and intelligent than solitary agents because a single job can be split into many smaller, more specialized tasks, says Chi Wang, a principal researcher at Microsoft Research in Redmond, Washington.


You Only Cache Once: Decoder-Decoder Architectures for Language Models 

Microsoft Research LinkedIn | May 11, 2024

YOCO is a novel decoder-decoder architecture for LLMs, enhancing memory efficiency by caching key-value pairs only once. It slashes KV cache memory and prefilling time and makes 1M-length LLMs practical.


Peter Lee discusses new technologies that will drive the future of drug discovery 

AAPS | May 10, 2024

The president of Microsoft Research explores how new advances in technologies, such as AI and machine learning, are transforming biotechnology, in the closing plenary of the AAPS National Biotechnology Conference (NBC) on Thursday, May 16.


PKSHA develops advanced LLMs in collaboration with Microsoft Japan 

Business Wire | April 29, 2024

PKSHA Technology has developed one of the first Japanese-English LLMs in collaboration with Microsoft Japan. This development primarily focuses on boosting productivity within contact centers and corporate help desks.


BRAID fellowships include three collaborations with Microsoft Research 

Bridging Responsible AI Divides | May 2024

BRAID fellowships support individual researchers in partnership with public and private organizations to address challenges in the field of responsible AI. Among the latest fellowships are three supported by Microsoft Research.

The post Research Focus: Week of May 13, 2024 appeared first on Microsoft Research.

Read More

Microsoft at CHI 2024: Innovations in human-centered design

Microsoft at CHI 2024: Innovations in human-centered design

Microsoft at CHI 2024

The ways people engage with technology, through its design and functionality, determine its utility and acceptance in everyday use, setting the stage for widespread adoption. When computing tools and services respect the diversity of people’s experiences and abilities, technology is not only functional but also universally accessible. Human-computer interaction (HCI) plays a crucial role in this process, examining how technology integrates into our daily lives and exploring ways digital tools can be shaped to meet individual needs and enhance our interactions with the world.

The ACM CHI Conference on Human Factors in Computing Systems is a premier forum that brings together researchers and experts in the field, and Microsoft is honored to support CHI 2024 as a returning sponsor. We’re pleased to announce that 33 papers by Microsoft researchers and their collaborators have been accepted this year, with four winning the Best Paper Award and seven receiving honorable mentions.

This research aims to redefine how people work, collaborate, and play using technology, with a focus on design innovation to create more personalized, engaging, and effective interactions. Several projects emphasize customizing the user experience to better meet individual needs, such as exploring the potential of large language models (LLMs) to help reduce procrastination. Others investigate ways to boost realism in virtual and mixed reality environments, using touch to create a more immersive experience. There are also studies that address the challenges of understanding how people interact with technology. These include applying psychology and cognitive science to examine the use of generative AI and social media, with the goal of using the insights to guide future research and design directions. This post highlights these projects.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Best Paper Award recipients

DynaVis: Dynamically Synthesized UI Widgets for Visualization Editing 
Priyan Vaithilingam, Elena L. Glassman, Jeevana Priya Inala, Chenglong Wang 
GUIs used for editing visualizations can overwhelm users or limit their interactions. To address this, the authors introduce DynaVis, which combines natural language interfaces with dynamically synthesized UI widgets, enabling people to initiate and refine edits using natural language.  

Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking  
Nikhil Sharma, Q. Vera Liao, Ziang Xiao  
Conversational search systems powered by LLMs potentially improve on traditional search methods, yet their influence on increasing selective exposure and fostering echo chambers remains underexplored. This research suggests that LLM-driven conversational search may enhance biased information querying, particularly when the LLM’s outputs reinforce user views, emphasizing significant implications for the development and regulation of these technologies.  

Piet: Facilitating Color Authoring for Motion Graphics Video  
Xinyu Shi, Yinghou Wang, Yun Wang, Jian Zhao 
Motion graphic (MG) videos use animated visuals and color to effectively communicate complex ideas, yet existing color authoring tools are lacking. This work introduces Piet, a tool prototype that offers an interactive palette and support for quick theme changes and controlled focus, significantly streamlining the color design process.

The Metacognitive Demands and Opportunities of Generative AI 
Lev Tankelevitch, Viktor Kewenig, Auste Simkute, Ava Elizabeth Scott, Advait Sarkar, Abigail Sellen, Sean Rintel 
Generative AI systems offer unprecedented opportunities for transforming professional and personal work, yet they present challenges around prompting, evaluating and relying on outputs, and optimizing workflows. This paper shows that metacognition—the psychological ability to monitor and control one’s thoughts and behavior—offers a valuable lens through which to understand and design for these usability challenges.  


Honorable Mentions

Big or Small, It’s All in Your Head: Visuo-Haptic Illusion of Size-Change Using Finger-Repositioning
Myung Jin Kim, Eyal Ofek, Michel Pahud, Mike J. Sinclair, Andrea Bianchi 
This research introduces a fixed-sized VR controller that uses finger repositioning to create a visuo-haptic illusion of dynamic size changes in handheld virtual objects, allowing users to perceive virtual objects as significantly smaller or larger than the actual device. 

LLMR: Real-time Prompting of Interactive Worlds Using Large Language Models 
Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski-Fahey, Judith Amores, Jaron Lanier 
Large Language Model for Mixed Reality (LLMR) is a framework for the real-time creation and modification of interactive mixed reality experiences using LLMs. It uses novel strategies to tackle difficult cases where ideal training data is scarce or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. 

Observer Effect in Social Media Use 
Koustuv Saha, Pranshu Gupta, Gloria Mark, Emre Kiciman, Munmun De Choudhury 
This work investigates the observer effect in behavioral assessments on social media use. The observer effect is a phenomenon in which individuals alter their behavior due to awareness of being monitored. Conducted over an average of 82 months (about 7 years) retrospectively and five months prospectively using Facebook data, the study found that deviations in expected behavior and language post-enrollment in the study reflected individual psychological traits. The authors recommend ways to mitigate the observer effect in these scenarios.

Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming 
Hussein Mozannar, Gagan Bansal, Adam Fourney, Eric Horvitz 
By investigating how developers use GitHub Copilot, the authors created CUPS, a taxonomy of programmer activities during system interaction. This approach not only elucidates interaction patterns and inefficiencies but can also drive more effective metrics and UI design for code-recommendation systems with the goal of improving programmer productivity. 

SharedNeRF: Leveraging Photorealistic and View-dependent Rendering for Real-time and Remote Collaboration 
Mose Sakashita, Bala Kumaravel, Nicolai Marquardt, Andrew D. Wilson 
SharedNeRF, a system for synchronous remote collaboration, utilizes neural radiance field (NeRF) technology to provide photorealistic, viewpoint-specific renderings that are seamlessly integrated with point clouds to capture dynamic movements and changes in a shared space. A preliminary study demonstrated its effectiveness, as participants used this high-fidelity, multi-perspective visualization to successfully complete a flower arrangement task. 

Understanding the Role of Large Language Models in Personalizing and Scaffolding Strategies to Combat Academic Procrastination 
Ananya Bhattacharjee, Yuchen Zeng, Sarah Yi Xu, Dana Kulzhabayeva, Minyi Ma, Rachel Kornfield, Syed Ishtiaque Ahmed, Alex Mariakakis, Mary P. Czerwinski, Anastasia Kuzminykh, Michael Liut, Joseph Jay Williams 
In this study, the authors explore the potential of LLMs for customizing academic procrastination interventions, employing a technology probe to generate personalized advice. Their findings emphasize the need for LLMs to offer structured, deadline-oriented advice and adaptive questioning techniques, providing key design insights for LLM-based tools while highlighting cautions against their use for therapeutic guidance.

Where Are We So Far? Understanding Data Storytelling Tools from the Perspective of Human-AI Collaboration 
Haotian Li, Yun Wang, Huamin Qu
This paper evaluates data storytelling tools using a dual framework to analyze the stages of the storytelling workflow—analysis, planning, implementation, communication—and the roles of humans and AI in each stage, such as creators, assistants, optimizers, and reviewers. The study identifies common collaboration patterns in existing tools, summarizes lessons from these patterns, and highlights future research opportunities for human-AI collaboration in data storytelling.


Learn more about our work and contributions to CHI 2024, including our full list of publications, on our conference webpage.

The post Microsoft at CHI 2024: Innovations in human-centered design appeared first on Microsoft Research.

Read More

RASCAL: Novel robotics for scalable and highly available automated storage and retrieval

RASCAL: Novel robotics for scalable and highly available automated storage and retrieval

This research paper was presented at the 
41st IEEE International Conference on Robotics and Automation (opens in new tab) (ICRA 2024), the premier international forum for robotics research.

White ICRA 2024 logo on teal background. On the right, the featured paper (RASCAL).

Over the past decade, robotics has revolutionized numerous industries that rely on storage systems, such as manufacturing and warehousing. In these contexts, robotics streamlines operations and increase efficiency, and automated storage and retrieval systems (ASRS) are at the heart of this technological shift, exemplifying the transition to smarter, computer-controlled logistics solutions. These systems quickly move items from storage to fulfilment stations, helping to increase speed and accuracy in the overall process. Yet despite these advances, current ASRS—whether rail-based, fixed, or free-roaming—continue to face challenges, often sacrificing scalability and availability for higher throughput capacity. For instance, the use of fixed robots in traditional tape storage libraries, typically used for archival storage, can lead to availability limitations, as the robots cannot pass each other, and a single robot failure can restrict access to a significant portion of the library.

Our paper, published at ICRA 2024, introduces RASCAL: A Scalable, High-redundancy Robot for Automated Storage and Retrieval Systems, which addresses these concerns. RASCAL is an untethered robot that improves the efficiency of vertical storage systems by operating across evenly spaced, parallel shelves and horizontal rails. Designed to maximize scalability and redundancy, it handles the storage and retrieval of small objects. RASCAL was inspired by the challenges of managing archival storage media in datacenters, and it’s the key component of Project Silica’s storage and retrieval system. However, RASCAL’s modularity enables it to be used in other scenarios as well. 

An innovative approach to archival storage

RASCAL’s design is based on four key principles:

  • Addressability: This allows any robot to access any item being stored on the shelves. 
  • Scalability: The system can adjust retrieval capacity and storage space by adding or removing robots and shelving with negligible downtime.  
  • Availability: A single robot failure minimally impacts access to items and routing, and it does not obstruct the operation of other robots.  
  • Serviceability: Robots can easily be added or removed from the rails without the need for special training.   

RASCAL’s motion system supports horizontal and vertical movement along storage panels assembled from contiguous storage racks. The parallel rail system enables independent and flexible movement. These rails are designed to be passive—functioning without the need for active power or energy sources, relying instead on their physical structure and positioning to guide and support the robot’s movement along the storage panels. The robot can travel along and between these rails using various pathways to reach a given item. Video 1 shows how RASCAL operates multiple robots on a single storage panel.

Video 1. Multiple robots in action

RASCAL utilizes a special rail geometry, allowing the robot to passively latch onto the rails with opposing wheels mounted on each end, as illustrated in Figure 1. This design ensures that the robot is securely held in place by gravity alone. The passive nature of this latching mechanism simplifies the process of adding or removing robots from the rails, as it does not require any tools or power.

Picture of a RASCAL prototype mounted on a Silica library. The library is composed of a series of connected storage racks that hold glass media. The storage panel's front has parallel rails mounted horizontally to allow the robot to move vertically and horizontally. RASCAL uses a pair of opposing wheels to latch onto these rails.
Figure 1. The RASCAL prototype in a Silica library.

The robot features two rotating assemblies known as wings, each equipped with wheels that allow it to move horizontally. The wings rotate in a choreographed sequence to enable ascent and descent. RASCAL climbs by unlatching one wing from its current rail while remaining attached to the other. It then rotates and secures its free wing to a new rail either two levels up or down. This is shown in Video 2.

Video 2. RASCAL’s novel climbing maneuver.
Video 3. RASCAL performing a pick operation.

Video 3 demonstrates RASCAL’s item-selection system, or picker interface, which is designed to handle various robotic tool attachments for precise pick-and-place operations. This interface can rotate in alternating directions during climbs, ensuring that the robotic tool attachment, or end effector, remains oriented towards the shelving while stationary, preventing the cables from tangling.

Advancing robotics and automation

As digital economies grow, the need for efficient storage and retrieval systems becomes increasingly urgent. Breakthroughs in robotics technology are poised to drive productivity, efficiency, and innovation across numerous industries. Developments like RASCAL, with its flexible design and advanced capabilities, are leading the way for the next generation of robotics and automation.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.



The post RASCAL: Novel robotics for scalable and highly available automated storage and retrieval appeared first on Microsoft Research.

Read More

Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit

Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit

This research was presented as a demonstration at the 40th IEEE International Conference on Data Engineering (opens in new tab) (ICDE 2024), one of the premier conferences on data and information engineering.

ICDE conference logo, in white, on the left side of the graphic. To the right, the first page of the accepted paper,

Since the inception of cloud computing, autoscaling has been an essential technique for optimizing resources and performance. By dynamically adjusting the number of computing resources allocated to a service based on current demand, autoscaling ensures that the service can handle the load efficiently while optimizing costs. However, developing and fine-tuning autoscaling algorithms, which govern this process, present significant challenges. The complexity and cost associated with testing these algorithms can lead to inefficient resource management and impede the development of more effective autoscaling strategies.

In our paper, “VASIM: Vertical Autoscaling Simulator Toolkit,” presented at ICDE 2024, we introduce a tool designed to address the complexities involved in assessing autoscaling algorithms. While existing simulation tools cover a range of capabilities, such as energy efficiency and fault tolerance, VASIM stands out by evaluating the critical recommender component within the algorithm and suggesting optimal resource scaling actions based on usage data, balancing performance and cost. This enables developers to iterate more rapidly, enhancing algorithmic performance, and improving resource efficiency and cost savings.

VASIM’s user-friendly interface simplifies the evaluation of autoscaling policies, as illustrated in Figure 1. First steps entail uploading historical data and defining autoscaling policies, including the algorithm and its parameters, shown in the left panel. The Simulation Run feature enables the modification of algorithm parameters, imported via a configuration file, and the execution of simulations based on the selected trace. A results screen displays the CPU limits determined by the selected policies as well as the actual CPU usage tailored to these limits. Additionally, VASIM provides fundamental metrics like throttling incidents, number of scaling operations, and amount of unused capacity, or slack, for the current simulation.

[On the left] Image of VASIM user interface. On the left panel, it has options to select from “Simulation Run”, “Simulation Tuning”, “Simulation Tuning History”. Option “Simulation Run” is selected. Below user has loaded a trace from csv file on disk (c_26742_perf_event_log.csv), algorithm C, metadata config json file from disk. Button “Visualize workload” was clicked and loaded trace is displayed. 

[On the right] On the right panel, user picked other parameters for simulation run (lag – how often recommender gives decision and initial core count) and algorithm parameter from json are shown for edit. 

Image of VASIM UI when simulation was run for selected algorithm, trace and parameter setting. It shows a graph with cpu usage in blue and the limit calculated by selected algorithm in red. It is different from the trace plot that was shown before because calculated limits were below cpu utilization, so the latter was cut off. On top of the plot it shows metrics of the simulation like average slack, average insufficient CPU, sum slack, sum insufficient CPU, number of scalings, number of times of insufficient CPU etc.
Figure 1. The VASIM user interface comprises a run simulation pane on the left and a results pane on the right.

VASIM achieves several important goals:

Resource efficiency and cost reduction. VASIM reduces costs by removing the need to test scaling operations in real-time, which would be resource intensive. This enables developers to adjust algorithms iteratively in a controlled, cost-efficient environment, accelerating development cycles. Because the tool allows users to upload CPU performance history and algorithm parameters, it delivers the results of scaling operations across the entire workload in minutes rather than hours.

Multi-objective optimization. It’s challenging to develop an autoscaling method that handles conflicting parameters. VASIM makes this easier by applying Pareto optimization techniques (opens in new tab), helping developers to find a balance among key metrics. Figure 2 depicts scatter plots for two metrics: average slack and average insufficient CPU. It also shows three optimization objectives: the optimal amount of slack, throttling, and number of scaling operations.

[On the left] A graph that plots the average slack on the Y axis and the average insufficient cpu on the X axis. It shows that the more average insufficient cpu decreases, the more average slack increases. There are six points in red that are pareto frontier points, all on the very edge of the graph but not too close to each other, showing some possible choices of configuration. 

[On the right] A 3D scatter plot displays the total slack on the X axis, cpu total throttle on the Y axis, and the amount of scalings in Z axis. It shows that as you aim to lower total slack and throttle, the amount of scalings increases.
Figure 2. The 2D diagram on the left shows a scatter plot of tuning with Pareto points. The 3D graph on the right shows a scatter plot with the three objectives.

Recommender algorithm testing. VASIM simplifies the process of testing and evaluating recommendation algorithms across diverse workloads. With all tuning jobs running in parallel, computation occurs more quickly, allowing users to efficiently adjust their recommender parameters as necessary. To assess the algorithm’s generalizability, we ran VASIM against 11 available open cluster traces (opens in new tab) for benchmarking and internal product workload traces. This enabled us to evaluate the algorithms’ robustness across a variety of workload types, including cyclical, bursty, and monotonic variations, demonstrating their reliability across different scenarios.

Versatility and adaptability. VASIM provides users with the flexibility to modify components, experiment with recommendation strategies, and evaluate the impact of changes in a controlled and customizable environment. Figure 3 shows the results of a simulation run on the same algorithm and historical performance data but with different parameters. This versatility ensures that infrastructure engineers can tailor the system to meet their needs, enhancing the overall effectiveness of their autoscaling strategies.

These graphs display VASIM running an identical algorithm on the same historical data but with varying parameters, affecting slack, throttling, and the frequency of scaling events. The objective is to maintain a minimal gap between the peak and the lowest resource utilization levels (the top of the bottom line and the bottom of the top line, respectively), and to reduce the space between the response lag indicated by the trailing edges to the left of the lines. Simultaneously, it's important to minimize the occurrence of scaling events to prevent disruptions in workload execution.
Figure 3. These graphs show VASIM running an identical algorithm on the same historical data but with varying parameters, affecting slack, throttling, and the frequency of scaling events. The objective is to maintain a minimal gap between the peak and the lowest resource utilization levels—the top of the bottom line and the bottom of the top line, respectively. The goal is also to reduce the space between the response lag indicated by the trailing edges to the left of the lines. Simultaneously, it’s important to minimize the occurrence of scaling events to prevent disruptions in workload execution.

Optimizing scalability and costs in Kubernetes environments

Our research on vertically autoscaling monolithic applications with a container-as-a-service algorithm (opens in new tab) helped us to better understand the tradeoffs between cost and availability that different algorithm variations introduce. Because VASIM is similar to standard autoscaling architecture (as in the Kubernetes Vertical Pod Autoscaler (opens in new tab) [VPA]) it allows us to test autoscaling algorithms for pods, applications, and virtual machine (VM) capacity. This is possible because these systems share similar components, including resource updaters, controllers, and recommenders. Despite differences in specific systems, their underlying architectures are sufficiently similar, enabling VASIM to effectively mimic them, as shown in Figure 4.

 
The image depicts how VASIM works. It has a Simulation Controller in the middle, which asks Recommender for decisions using one of the algorithms, Simulation Scaler with a scale function, Cloud State Provider to get traces and use them for time simulation, Analyzer to get metrics after each run. Params Tuning Controller tells Simulation Controller to run for every tuning setting and calls Analyzer to get pareto front to find tradeoff between multiple goals after multiple configs were evaluated. Recommender also needs data from Cloud State Provider to access historical data.
Figure 4. VASIM architecture mimics the main components of general autoscaling architectures, allowing users to parametrize those modules to fit their specific needs.
 

Implications and looking ahead

Looking forward, we plan to broaden the scope of VASIM’s support beyond just CPUs to include a wide range of resources, such as memory, disk I/O, and network bandwidth. This expansion will provide future users with a comprehensive understanding of system performance and enable them to make more accurate decisions regarding system management and resource optimization. Additionally, a deeper understanding of system performance will help inform proactive optimization strategies focused on maximizing system efficiency and performance.

The post Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit appeared first on Microsoft Research.

Read More

MatterSim: A deep-learning model for materials under real-world conditions

MatterSim: A deep-learning model for materials under real-world conditions

The image features a complex network of interconnected nodes with a molecular structure, illuminated in blue against a dark background.

In the quest for groundbreaking materials crucial to nanoelectronics, energy storage, and healthcare, a critical challenge looms: predicting a material’s properties before it is even created. This is no small feat, with any combination of 118 elements in the periodic table, and the range of temperatures and pressures under which materials are synthesized and operated. These factors drastically affect atomic interactions within materials, making accurate property prediction and behavior simulation exceedingly demanding.

Here at Microsoft Research, we developed MatterSim, a deep-learning model for accurate and efficient materials simulation and property prediction over a broad range of elements, temperatures, and pressures to enable the in silico materials design. MatterSim employs deep learning to understand atomic interactions from the very fundamental principles of quantum mechanics, across a comprehensive spectrum of elements and conditions—from 0 to 5,000 Kelvin (K), and from standard atmospheric pressure to 10,000,000 atmospheres. In our experiment, MatterSim efficiently handles simulations for a variety of materials, including metals, oxides, sulfides, halides, and their various states such as crystals, amorphous solids, and liquids. Additionally, it offers customization options for intricate prediction tasks by incorporating user-provided data.

Figure 1: There are two subfigures. On the left-hand side, atomic structures of 12 materials belonging to metals, oxides, sulfides, halides, and organic molecules are shown. On the right-hand side, the temperature and pressure ranges of materials' application and synthesis are plotted.
Figure 1. MatterSim can model materials properties and behaviors under realistic temperature and pressure conditions for wide ranges of applications.

Simulating materials under realistic conditions across the periodic table

MatterSim’s learning foundation is built on large-scale synthetic data, generated through a blend of active learning, generative models, and molecular dynamics simulations. This data generation strategy ensures extensive coverage of material space, enabling the model to predict energies, atomic forces, and stresses. It serves as a machine-learning force field with a level of accuracy compatible with first-principles predictions. Notably, MatterSim achieves a10-fold increase in accuracy for material property predictions at finite temperatures and pressures when compared to previous state-of-the-art models. Our research demonstrates its proficiency in simulating a vast array of material properties, including thermal, mechanical, and transport properties, and can even predict phase diagrams.

Figure 2: There are three subfigures. The panel on the left shows a comparison of the highest phonon frequency predicted by MatterSim and by first-principles methods. The two values are for each material is very close, leading to a nearly straight line in the parity plot. The middle panel depicts the same relation of free energies of around 50 materials and comparison between MatterSim and first-principles results. The right panel shows the phase diagram of MgO predicted using MatterSim. The x-axis denotes the temperature and the y-axis denotes the pressure. The pressure ranges of where MgO’s B1 phase is below 500 GPa and this range decreases with temperature increase. The blue lines show the prediction from MatterSim and fits well with the shaded region which is the result from experiment measurement.
Figure 2. MatterSim achieves high accuracy in predicting mechanical properties, vibrational properties, and phases diagrams of material comparable to quantum mechanics and experimental measurements. The figure shows the comparison between the predicted properties and the experimental measured results. 

Adapting to complex design tasks

While trained on broad synthetic datasets, MatterSim is also adaptable for specific design requirements by incorporating additional data. The model utilizes active learning and fine-tuning to customize predictions with high data efficiency. For example, simulating water properties — a task seemingly straightforward but computationally intensive — is significantly optimized with MatterSim’s adaptive capability. The model requires only 3% of the data compared to traditional methods, to match experimental accuracy that would otherwise require 30 times more resources for a specialized model and exponentially more for first-principles methods.

Figure 3: There are two panels in this figure. The right panel shows the structure of Li2B12H12, a complex material system used for solid-state batteries. This system is used in the benchmark of the performance of MatterSim. The left panel panels show the comparison between number of data point needed to train a model from scratch and customize from MatterSim to achieve the same accuracy. MatterSim requires 3% and 10% of the data for the two tasks compared with training from scratch.
Figure 3. MatterSim achieves high data efficiency with 90%-97% data save for complex simulation tasks.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


Bridging the gap between atomistic models and real-world measurements

Translating material properties from atomic structures is a complex task, often too intricate for current methods based on statistics, such as molecular dynamics. MatterSim addresses this by mapping these relationships directly through machine learning. It incorporates custom adaptor modules that refine the model to predict material properties from structural data, eliminating the need for intricate simulations. Benchmarking against MatBench (opens in new tab), a renowned material property prediction benchmark set, MatterSim demonstrates significant accuracy improvement and outperforms all specialized property-specific models, showcasing its robust capability in direct material property prediction from domain-specific data.

Looking ahead 

As MatterSim research advances, the emphasis is on experimental validation to reinforce its potential role in pivotal sectors, including the design of catalysts for sustainability, energy storage breakthroughs, and nanotechnology advancements. The planned integration of MatterSim with generative AI models and reinforcement learning heralds a new era in the systematic pursuit of novel materials. This synergy is expected to revolutionize the field, streamlining guided creation of materials tailored for diverse applications ranging from semiconductor technologies to biomedical engineering. Such progress promises to expedite material development and bolster sustainable industrial practices, thereby fostering technological advancements that will benefit society. 

The post MatterSim: A deep-learning model for materials under real-world conditions appeared first on Microsoft Research.

Read More

LLM profiling guides KV cache optimization

LLM profiling guides KV cache optimization

This research paper was presented at the 12th International Conference on Learning Representations (opens in new tab) (ICLR 2024), the premier conference dedicated to the advancement of deep learning.

White ICLR logo to the left of the first page of the accepted paper, “Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs” on a purple background.

Large language models (LLMs) rely on complex internal mechanisms that require more memory than what is typically available to operate on standard devices. One such mechanism is the key-value (KV) cache, which stores and retrieves previously computed data, helping the model generate responses quickly without needing to recalculate information it has already processed. This method uses a substantial amount of memory because it keeps a large amount of this data readily accessible to enhance the model’s speed and efficiency. Consequently, the KV cache can become prohibitively large as the complexity of the tasks increases, sometimes requiring up to 320 GB for a single operation. To address this, we developed FastGen, a novel method aimed at reducing the memory demands for LLMs.

Our paper, “Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (opens in new tab),” presented at ICLR 2024, we describe how FastGen optimizes the way LLMs store and access data, potentially cutting memory use by half while preserving their efficiency. This approach represents a significant step toward making sophisticated AI tools more accessible and affordable for broader applications. We are honored to share that this paper has been awarded an Honorable Mention for the Outstanding Paper Award (opens in new tab).

Observations of the KV cache

The development of FastGen is underpinned by our observations of how the KV cache functions. We first observed that not all the data in the KV cache is needed for LLMs to complete their required tasks, as shown in Figure 1. By providing the KV cache with the mechanism to discard unnecessary data, it is possible to significantly cut memory use. For example, some LLM modules don’t require broad contexts to process input. For this, it is possible to construct a KV cache that removes data that contains less important long-range contexts, such as several sentences or paragraphs. Also, some LLM modules primarily attend only to special tokens, such as punctuation, for which it is possible to create a KV cache that retains only those tokens. Finally, some LLM modules broadly need all tokens, and for these we can employ the standard KV cache and store all words.  

Another key observation in our study is that attention modules in different layers and positions in the LLM behave differently and need different preferences for their KV cache, as shown on the right in Figure 1. 

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


Graphs depicting the different structures of the KV cache. The graph on the left contains common structures. The circle graphs on the right contain compositions of three modules that are in the same layer, but the way they store data is different.
Figure 1: These graphs depict the different structures of the KV cache. The graph on the left contains common structures. The circle graphs on the right contain compositions of three modules that are in the same layer, but the way they store data is different.

FastGen accounts for the diversity of KV cache structures

Because different KV caches have different structures, they need to be handled differently. We based the development of the FastGen algorithm on our observations, enabling it to categorize and optimize the data that is stored in a given KV cache. FastGen first analyzes the specific behaviors of different modules to understand their structures, a method called profiling. It then uses the results to adjust how data is stored in real-time, making the process more efficient. Our tests show that FastGen can reduce the amount of memory by 50% without sacrificing quality. Additional experiments, discussed in detail in our paper, confirm that the profiling process is crucial and significantly improves the efficiency of the KV cache.  

The broader picture

Fueled by unprecedented advances in data handling and computational capabilities, LLM pretraining has emerged as a cornerstone of deep learning, transforming natural language processing tasks and continuously challenging our understanding of learning and cognition.

However, greater capabilities can bring challenges. As models scale larger, customizing them for specific tasks can become more resource-intensive. At Microsoft Research, we are exploring different approaches to more efficient model editing. A critical strategy involves targeted model profiling, which identifies essential components of a model that align with predefined goals. This profiling informs precise model modifications, optimizing resource use and effectiveness.

The two research projects we are presenting at ICLR 2024 support these goals. Both adopt the profile-then-edit paradigm to address different problems. FastGen reduces memory consumption. Our related work, Post-hoc Attention Steering for LLMs (PASTA), focuses on better controllability. These approaches are designed to be resource-efficient, as they do not require tuning or back propagation. Looking ahead, our goal is to further develop these techniques to improve the resource-efficiency of LLM applications, making them more accessible to a wider audience.  

The post LLM profiling guides KV cache optimization appeared first on Microsoft Research.

Read More

LoftQ: Reimagining LLM fine-tuning with smarter initialization

LoftQ: Reimagining LLM fine-tuning with smarter initialization

This research paper was presented at the 12th International Conference on Learning Representations (opens in new tab) (ICLR 2024), the premier conference dedicated to the advancement of deep learning.

Teal background with ICLR logo on the right (head and face) with LoftQ paper on the right.

Large language models (LLMs) use extensive datasets and advanced algorithms to generate nuanced, context-sensitive content. However, their development requires substantial computational resources. To address this, we developed LoftQ, an innovative technique that streamlines the fine-tuning process—which is used to adapt pre-trained language models to perform well in specialized applications, such as analyzing medical documents. During fine-tuning, the model undergoes additional training on a smaller, task-specific dataset. This results in improved performance, such as more accurate predictions, better understanding of domain-specific language, and more relevant responses in the context of the specialized area.

LoftQ’s strength lies in its ability to combine quantization and adaptive initialization during fine-tuning. Quantization reduces the precision of model parameters, lowering memory and computation needs. This not only accelerates processing but also reduces power consumption. Adaptive initialization closely aligns the model’s parameters to its optimal pre-trained state, preserving its capabilities while minimizing resource use. Our paper, “LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models,” presented at ICLR 2024, details how this method can help make AI technologies more efficient and sustainable. 

How LoftQ works 

LoftQ builds on the principles of LoRA (opens in new tab) and QLoRA (opens in new tab). LoRA is a method that greatly reduces the number of parameters needed for training, decreasing the memory requirements for fine-tuning. QLoRA is a fine-tuning approach that uses 4-bit quantized, frozen weights and low rank adapters, significantly reducing memory requirements while maintaining high performance. This is illustrated in Table 1, which shows the amount of memory needed for fine-tuning an LLM with 7 billion parameters as well as the memory requirements for LoRA and QLoRA. LoRA achieves a fourfold reduction in memory usage, and QLoRA further reduces it by twofold.

LoftQ - Table 1: This table shows the GPU memory usage for a 7-billion parameter LLM, with the following configurations: full fine-tuning on the left, LoRA in the middle, and QLoRA on the right.
Table 1: This table shows the GPU memory usage for a 7-billion parameter LLM with the following configurations: full fine-tuning on the left, LoRA in the middle, and QLoRA on the right.

Unlike LoRA, QLoRA comes with a tradeoff, where some quality of the pretrained model is sacrificed due to the quantization of weights. LoftQ recognizes this and optimizes the initialization of quantization and low-rank adaptation matrices. That is, LoftQ seeks to identify a combination of a quantized matrix and a low rank matrix such that their sum closely approximates the original pretrained weight. This is done for every matrix that would be adapted in the model.

The LoftQ algorithm alternates between two primary steps. First it quantizes (simplifies) the weights, and then it finds the best low-rank factors that approximate the quantization between the pretrained weight and the low-rank weight. The process repeats for a few steps. This method enables the fine-tuning process to start from a more effective initial state, which preserves accuracy while using less computational power and much more simplified weights.

LoftQ requires a one-time setup to simplify and prepare these weights, allowing a fixed portion of the model’s parameters (e.g., 5 percent) to be adjusted. Once established, this configuration can be repeatedly applied as the model transitions between various tasks and settings.

Evaluating LoftQ 

Tests using various types of LLMs, including those with different combinations of encoding and decoding capabilities like the Llama-2, show that models initialized with LoftQ consistently achieve strong performance, often matching or surpassing those configured with QLoRA.

In practical terms, comparing the performance of LoftQ and QLoRA on different tasks using the Llama-2 model family yields distinct results, which are highlighted in Table 2. For the WikiText-2 dataset, which measures the model’s perplexity (lower is better), and the GSM8K dataset, which tests the model’s ability to solve basic math problems (higher is better), we demonstrate the effectiveness of varying degrees of weight simplification—averaging 3, 2.5, and 2.25 bits per weight. Our paper discusses the results in more detail. 

LoftQ - Table 2. This table compares LoftQ and QLoRA during the fine-tuning of two Llama-2 models on the Wikitext-2 and GSM8K datasets.
Table 2. This table compares LoftQ and QLoRA during the fine-tuning of two Llama-2 models on the Wikitext-2 and GSM8K datasets.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Implications and looking forward 

LoftQ promises to advance the field of AI by accelerating research and facilitating the creation of cutting-edge tools while supporting sustainable development. While initially focused on LLMs, LoftQ’s flexible design also supports fine-tuning in other types of models, such those for vision and speech technologies. As our research progresses, we expect to make further enhancements that will boost performance on downstream tasks. We hope these improvements will lead to broader adoption across various AI applications. We’re excited about the breadth of this technology’s applicability and encourage the AI community to explore its benefits. LoftQ is available as open source through the Hugging Face PEFT library (opens in new tab).

The post LoftQ: Reimagining LLM fine-tuning with smarter initialization appeared first on Microsoft Research.

Read More

Abstracts: May 6, 2024

Abstracts: May 6, 2024

Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Researcher Michel Galley joins host Gretchen Huizinga to discuss “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts,” which was accepted at the 2024 International Conference on Learning Representations (ICLR). MathVista, an open-source benchmark, combines new and existing data to measure how good models are at solving a variety of math problems that involve processing images as well as text, helping to gain insight into their reasoning capabilities.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

My guest today is Dr. Michel Galley, a senior principal researcher at Microsoft Research. Dr. Galley is the coauthor of a paper called “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.” Michel, thanks for joining us on Abstracts today!


MICHEL GALLEY: Thank you for having me.

HUIZINGA: So I like to start with a distillation or sort of an elevator pitch of your research. Tell us in just a couple sentences what problem or issue your paper addresses and why we should care about it.

GALLEY: So this paper is about evaluating large foundation models. So it’s a very important part of researching large language models because it’s a good way to evaluate, kind of, the capabilities—what these models are good at and not good at. And a part of the focus of MathVista is to evaluate these large foundation models in a multimodal setup, so when the input to the model is actually not just text but also text and images. And then, an example of a task that such a model would perform is, like, the input is maybe a mathematical question, and then there’s some visual support to that question, let’s say, of an image of a graph, and then the model has to respond to something related to that. And why this is important … there has been a lot of work, of course, on large foundation model. Especially when it comes to reasoning tasks, like mathematical reasoning, a lot has focused more on written form.

HUIZINGA: Yeah …

GALLEY: So MathVista is one of the very first datasets that has input that is both images and text.

HUIZINGA: Yeah, yeah. Well, reading your paper, it seems like this is an area that hasn’t been studied systematically. In fact, you actually say that! And say that the field is largely unexplored. But quickly tell us what has been done in this field, and then tell us how your research addresses the proverbial gap in the literature.

GALLEY: Well, there has been a lot of work on vision and language in other problems, like not just about reasoning. Maybe let me just mention why reasoning is important. So one reason I think it’s very interesting to evaluate these large language models in terms of reasoning skill is that we evaluate their capabilities beyond just memorization. So as many of your listeners probably know, these large foundation models are trained on large amounts of text that is public data from various sources. So when you ask a question to a large foundation model, it could be the case, in many cases, that it just memorizes things it has seen in the data.

HUIZINGA: Sure.

GALLEY: So what makes it interesting in terms of reasoning, the answer oftentimes is not there in the data. So it needs to develop this ability to connect the dots between various pieces of information to come up with a new answer. So the focus of our paper is really on mathematical reasoning, but it goes also a bit beyond that because what is also represented in the data is also science question and so on.

HUIZINGA: Yeah …

GALLEY: So this reasoning part has largely focused, until MathVista, on text-only modalities.

HUIZINGA: Yeah …

GALLEY: So it’s one of our very first ones that combines text and images in terms of evaluating these large foundation models. So you ask about what was done before. So, yes, there has been a lot of work, text only, on reasoning, for example, the mathematical question that’s just based on text. And there has been a different stream of work that was much more focused on vision. A lot of work has been on tasks such as visual question answering …

HUIZINGA: Yeah …

GALLEY: … where basically, you have an image and the question is about answer a question about this image. So, yes, we’re trying to fuse the two lines of research here.

HUIZINGA: Right …

GALLEY: And that’s one of the first works that does that.

HUIZINGA: Yeah. Well, let’s talk about your methodology for a minute. Tell us how you went about conducting this research, and what methods did you use?

GALLEY: Yes, sure. So that’s a bit different from a typical, kind of, machine learning paper because the focus on this work is really on benchmarking on the dataset. So the methodology is more about how we collect the data, process it. So they have two components to doing that. One was to look at existing data that already combines vision and text. And there are existing datasets that are actually already fairly big but that were not focused on reasoning. So we use those existing datasets and look for instances in the data that actually include some mathematical or science reasoning. And so that part is leveraging existing datasets, but the important part is, like, we really want to carve out what was interesting piece in terms of reasoning. And we had different stages of processing the data to identify the subset that was reasoning-based. So one first step was basically to apply some automatic filter to determine whether or not a given example, let’s say something that is visual and text, is actually … involves some mathematical reasoning. So we have different strategy. For example, if the answer is numerical, it’s likely that it might be something mathematically related. But that’s just the first stage. And the second stage, we actually had humans, annotators, just certify that the selected data is actually of high quality. So we do have an example of, “Oh, this is mathematical, and that’s either mathematical or scientific,” and so on. And that’s one part of the effort. The other part is that we realized while we collected the data, there are certain types of mathematical reasoning or related to mathematical reasoning that were not represented in the data. So we created three new datasets as part of MathVista. So when I said dataset, it’s more like, think of MathVista as like an aggregate of different types of data, and we added three of them, three new types of data. One is what you call PaperQA, which is basically data that is collected from scientific papers on arXiv, and that had questions asking about that paper and that included some visual components from the paper, typically a plot or a figure.

HUIZINGA: Yeah …

GALLEY: And then we had IQTest, which is basically, I mean, it’s vaguely related mathematically, but basically it also, kind of, tried to see maybe more abstractive thinking about maybe some input that is both text and visual. And the final is about FunctionQA, that is basically algebraic reasoning and function plots and so on.

HUIZINGA: OK …

GALLEY: The important part was actually to identify among vast amounts of data what is actually very interesting in terms of mathematical reasoning.

HUIZINGA: Yeah …

GALLEY: So that part, I think, was quite a big part of doing that work—finding existing data but also creating new data.

HUIZINGA: Yeah, yeah. Well, my favorite part of a research paper is where it says, “and what we found was … ,” so talk a little bit about your results. What did you find?

GALLEY: So we evaluated a wide variety of models, including GPT-4, Claude 2, GPT-4V, multimodal Bard, and LLaVA, and we categorized them into three categories. So one is text only. So, basically, you take a model that is by default just text, and we give it the text part of the question and ask it to answer the question. Of course, that’s, kind of, a bit of a, it’s a difficult task because oftentimes [LAUGHTER] we crucially build these questions so that you have to rely on the vision part. But that’s for, you know, scientific investigation to know how well they can do, and so that’s one category of model. A different category is still text only but that is given the detection from the image. So on the image, we do OCR. So we convert those words from images to text. It’s kind of an extension of the text-based model, except that what was images is translated into text, and then the input to the model is word only, and that’s a different category of model. And the third one is basically truly multimodal model. And what we found, I mean, not surprisingly, it’s, kind of, the one that was doing most poorly is the one that is text only. The second is text plus OCR. And then finally, the one that does best is the multimodal like GPT-4V. But while the ordering between these three categories makes sense, it was a bit surprising that maybe the gap between multimodal and text plus OCR was not bigger. Well, it’s big, but maybe not as big as we were expecting. So, for example, the best detection from the images model achieved like 35 percent accuracy while GPT-4V was 50 percent. So it’s a substantial gap but not huge.

HUIZINGA: Right. Just to clarify, you’re saying OCR. What does that stand for?

GALLEY: [Optical] character recognition.

HUIZINGA: Gotcha.

GALLEY: So, basically, it’s the task of taking text, sometimes typed, but sometimes written, and convert this into the actual text like you would have in a text file.

HUIZINGA: Right. Michel, does any of this have to do with the difficulty of the math problems that you present these models with? I mean, it seems to me, similar to humans, that the easier the problem, the easier it would be for the machine. So at what level of math are we talking for these tests?

GALLEY: What’s nice about MathVista is there’s continuum [of] different difficulties. So the spectrum is quite broad, going from elementary school to more advanced concepts such as calculus. So it’s quite broad. So in the paper, we do have this, kind of, broken down by level. So the number I gave you, like 50 percent, is an aggregate over all the difficulties. But …

HUIZINGA: Gotcha.

GALLEY: But the goal there was really, kind of, to compare different models, but we do have a fair amount of analysis in the appendix. Actually, we have 100 pages of appendices of plenty of analysis and so on. So if people, I mean …

HUIZINGA: I saw that. I saw the length of the paper, and I’m going, what? [LAUGHS] That’s a LONG paper! Well, research in the lab is one thing, I always like to say, but understanding real-world impact is important, too. So where’s this work going to make the most difference, and who does it help most at this point?

GALLEY: Well, I think perhaps that’s the main point of this kind of line of work in terms of reasoning is that when looking at this difficult problem that are mathematical, actually it’s a way to, kind of, abstract away maybe more complex capabilities, and I think while thinking just about mathematics might seem a bit narrow, I don’t think that really is. It’s more about seeing whether this model has the ability to do, kind of, multistep kind of processing of your input and think maybe somewhat intelligently about a given problem. So we focus mostly on math. There is some science, but we would be very interested, especially in future work, to, kind of, go beyond that.

HUIZINGA: OK, well, let me press in a little bit there because … just say I’m a regular person using a GPT model. Is your work more addressed upstream from that to the research community to say, how do we get these models to be better so that downstream people like me can be more confident of the models?

GALLEY: Yes, I would say at the moment, I mean, this line of work is perhaps more geared towards somewhat more research community, but I think it could be some seed for researchers to think about some applications perhaps that also requires some kind of step-by-step reasoning but perhaps not going beyond math.

HUIZINGA: Yeah. Michel, if there was one thing you wanted our listeners to take away from this research, kind of golden nugget, what would it be?

GALLEY: Well, I would say it’s the challenging part of these datasets. I think that’s what makes MathVista stand out compared to other datasets. By now, there are a few other vision and language datasets, and of course, many that are more text-based. And we’ve seen, for example, some recent papers showing that actually MathVista remains one of the most challenging ones. So I think it’s probably going to stay around for a while because of the difficulty it represents. So it’s open source of available datasets that everybody can use, and I very much encourage people to use it.

HUIZINGA: Is it on GitHub?

GALLEY: Yes, it’s on GitHub.

HUIZINGA: So what’s next on the research agenda for helping LLMs get better at math, Michel? What are the big challenges in the field yet? I mean, you’ve alluded to many of them already, sort of, but what’s next on your research agenda?

GALLEY: Well, I would say what we found so far is these models are very good at processing the textual part of problems it’s given, to the model, but you have the equivalent in images actually harder somehow. So I think a lot more work needs to be done in terms of vision capabilities, in terms of reasoning over images, because the capabilities you will see in text are actually quite advanced, whereas the equivalent in images doesn’t seem that good. I mean, a fair disclaimer: my background is more on the text side, [LAUGHTER] so some of my colleagues on the paper are more on the vision side, so maybe if a listener maybe run into some of our coauthors at the conference, they might want to talk to these vision people because that’s less of my background. [LAUGHS]

HUIZINGA: Well, and if you think about Venn diagrams, you know, you’ve got people that are doing text, people that are doing vision, and then the people that are trying to do both to see how the worlds collide.

[MUSIC]

Well, Michel Galley, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find it on arXiv. You can also read it on the website for the International Conference on Learning Representations, or ICLR. And if you happen to be at the ICLR conference this week, you can hear more about it there. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: May 6, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of April 29, 2024

Research Focus: Week of April 29, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of April 29, 2024

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a program’s intent. However, there is no guarantee that a program’s implementation aligns with its natural language documentation. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. However, this information is often underutilized, due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The “emergent abilities” of large language models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, due to a lack of benchmarks and evaluation metrics, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent—and whether such translation could be useful in practice.

In a new paper: Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? (opens in new tab), researchers from Microsoft describe nl2postcond, the problem leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. The paper, to be presented at the upcoming ACM International Conference on the Foundations of Software Engineering (opens in new tab), introduces and validates metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. The researchers show that nl2postcond via LLMs has the potential to be helpful in practice by demonstrating that LLM-generated specifications can be used to discover historical bugs in real-world projects. 


Semantically Aligned Question and Code Generation for Automated Insight Generation

People who work with data, like engineers, analysts, and data scientists, often must manually look through data to find valuable insights or write complex scripts to automate exploration of the data. Automated insight generation provides these workers the opportunity to immediately glean insights about their data and identify valuable starting places for writing their exploration scripts. Unfortunately, automated insights produced by LLMs can sometimes generate code that does not correctly correspond (or align) to the insight. In a recent paper: Semantically Aligned Question and Code Generation for Automated Insight Generation (opens in new tab), researchers from Microsoft leverage the semantic knowledge of LLMs to generate targeted and insightful questions about data and the corresponding code to answer those questions. Through an empirical study on data from Open-WikiTable (opens in new tab), they then show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. The research also shows that generating questions and code together yields more interesting and diverse insights about data. 


Explaining CLIP’s performance disparities on data from blind/low vision users

AI-based applications hold the potential to assist people who are blind or low vision (BLV) with everyday visual tasks. However, human assistance is often required, due to the wide variety of assistance needed and varying quality of images available. Recent advances in large multi-modal models (LMMs) could potentially address these challenges, enabling a new era of automated visual assistance. Yet, little work has been done to evaluate how well LMMs perform on data from BLV users.

In a recent paper: Explaining CLIP’s performance disparities on data from blind/low vision users (opens in new tab), researchers from Microsoft and the World Bank address this issue by assessing CLIP (opens in new tab), a widely-used LMM with potential to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, their results show that disability objects, like guide canes and Braille displays, are recognized significantly less accurately than common objects, like TV remote controls and coffee mugs—in some cases by up to 28 percentage points difference. 

The researchers perform an analysis of the captions in three large-scale datasets that are commonly used to train models like CLIP and show that BLV-related content (such as guide canes) is rarely mentioned. This is a potential reason for the large performance gaps. The researchers show that a few-shot learning approach with as little as five example images of a disability object can improve its ability to recognize that object, holding the potential to mitigate CLIP’s performance disparities for BLV users. They then discuss other possible mitigations. 

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


Closed-Form Bounds for DP-SGD against Record-level Inference 

Privacy of training data is a central consideration when deploying machine learning (ML) models. Models trained with guarantees of differential privacy (DP) provably resist a wide range of attacks. Although it is possible to derive bounds, or safe limits, for specific privacy threats solely from DP guarantees, meaningful bounds require impractically small privacy budgets, which results in a large loss in utility.
 
In a recent paper: Closed-Form Bounds for DP-SGD against Record-level Inference, researchers from Microsoft present a new approach to quantify the privacy of ML models against membership inference (inferring whether a data record is in the training data) and attribute inference (reconstructing partial information about a record) without the indirection through DP. They focus on the popular DP-SGD algorithm, which they model as an information theoretic channel whose inputs are the secrets that an attacker wants to infer (e.g., membership of a data record) and whose outputs are the intermediate model parameters produced by iterative optimization. They obtain closed-form bounds for membership inference that match state-of-the-art techniques but are orders of magnitude faster to compute. They also present the first algorithm to produce data-dependent bounds against attribute inference. Compared to bounds computed indirectly through numerical DP budget accountants, these bounds provide a tighter characterization of the privacy risk of deploying an ML model trained on a specific dataset. This research provides a direct, interpretable, and practical way to evaluate the privacy of trained models against inference threats without sacrificing utility.

Microsoft Research in the news


TIME100 Most Influential People in Health 

TIME | May 2, 2024

Microsoft Research president Peter Lee is included as an innovator on the 2024 TIME100 Health list, TIME’s inaugural list of 100 individuals who most influenced global health this year.


Sanctuary AI Announces Microsoft Collaboration to Accelerate AI Development for General Purpose Robots 

Sanctuary AI | May 1, 2024

Sanctuary AI and Microsoft are collaborating on the development of AI models for general purpose humanoid robots. Sanctuary AI will leverage Microsoft’s Azure cloud resources for their AI workloads.


Tiny but mighty: The Phi-3 small language models with big potential 

Microsoft Source | April 23, 2024

LLMs create exciting opportunities for AI to boost productivity and creativity. But they require significant computing resources. Phi-3 models, which perform better than models twice their size, are now publicly available from Microsoft.


AI Is Unearthing New Drug Candidates, But It Still Needs Human Oversight 

Drug Discovery Online | April 11, 2024

Drug Discovery Online published a contributed article from Junaid Bajwa discussing how recent advancements in AI offer the potential to streamline and optimize drug development in unprecedented ways.


How AI is helping create sustainable farms of the future 

The Grocer | April 16, 2024

Ranveer Chandra authored an essay on how AI is helping create sustainable farms of the future for UK-based trade outlet, The Grocer.


The Future of AI and Mental Health 

Psychiatry Online | April 16, 2024

Psychiatric News published an article featuring Q&A with Jina Suh, highlighting the important considerations for the use of AI technologies among psychiatrists and mental health professionals.


MatterGen’s Breakthroughs: How AI Shapes the Future of Materials Science 

Turing Post | April 19, 2024

Turing Post covered MatterGen in an interview with Tian Xie. Learn more about this impactful generative model for inorganic materials design.


Machine Learning Street Talk interview with Chris Bishop 

Machine Learning Street Talk | April 10, 2024

Chris Bishop joined Dr. Tim Scarfe for a wide-ranging interview on advances in deep learning and AI for science.

The post Research Focus: Week of April 29, 2024 appeared first on Microsoft Research.

Read More

Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications

Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications

ASPLOS 2024 logo in white on a blue and green gradient background

Modern computer systems and applications, with unprecedented scale, complexity, and security needs, require careful co-design and co-evolution of hardware and software. The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (opens in new tab), is the main forum where researchers bridge the gap between architecture, programming languages, and operating systems to advance the state of the art.

ASPLOS 2024 is taking place in San Diego between April 27 and May 1, and Microsoft researchers and collaborators have a strong presence, with members of our team taking on key roles in organizing the event. This includes participation in the program and external review committees and leadership as the program co-chair.

We are pleased to share that eight papers from Microsoft researchers and their collaborators have been accepted to the conference, spanning a broad spectrum of topics. In the field of AI and deep learning, subjects include power and frequency management for GPUs and LLMs, the use of Process-in-Memory for deep learning, and instrumentation frameworks. Regarding infrastructure, topics include memory safety with CHERI, I/O prefetching in modern storage, and smart oversubscription of burstable virtual machines. This post highlights some of this work.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Paper highlights

Characterizing Power Management Opportunities for LLMs in the Cloud

The rising popularity of LLMs and generative AI has led to an unprecedented demand for GPUs. However, the availability of power is a key limiting factor in expanding a GPU fleet. This paper characterizes the power usage in LLM clusters, examines the power consumption patterns across multiple LLMs, and identifies the differences between inference and training power consumption patterns. This investigation reveals that the average and peak power consumption in inference clusters is not very high, and that there is substantial headroom for power oversubscription. Consequently, the authors propose POLCA: a framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. It can deploy 30% more servers in the same GPU clusters for inference tasks, with minimal performance degradation.

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

PIM-DL is the first deep learning framework specifically designed for off-the-shelf processing-in-memory (PIM) systems, capable of offloading most computations in neural networks. Its goal is to surmount the computational limitations of PIM hardware by replacing traditional compute-heavy matrix multiplication operations with Lookup Tables (LUTs). PIM-DL first enables neural networks to operate efficiently on PIM architectures, significantly reducing the need for complex arithmetic operations. PIM-DL demonstrates significant speed improvements, achieving up to ~37x faster performance than traditional GEMM-based systems and showing competitive speedups against CPUs and GPUs.

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety

Memory safety bugs have persistently plagued software for over 50 years and underpin some 70% of common vulnerabilities and exposures (CVEs) every year. The CHERI capability architecture (opens in new tab) is an emerging technology (opens in new tab) (especially through Arm’s Morello (opens in new tab) and Microsoft’s CHERIoT (opens in new tab) platforms) for spatial memory safety and software compartmentalization. In this paper, the authors demonstrate the viability of object-granularity heap temporal safety built atop CHERI with considerably lower overheads than prior work.

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines

Burstable virtual machines (BVMs) are a type of virtual machine in the cloud that allows temporary increases in resource allocation. This paper shows how to oversubscribe BVMs. It first studies the characteristics of BVMs on Microsoft Azure and explains why traditional approaches based on using a fixed oversubscription ratio or based on the Central Limit Theorem do not work well for BVMs: they lead to either low utilization or high server capacity violation rates. Based on the lessons learned from the workload study, the authors developed a new approach, called AUDIBLE, using a nonparametric statistical model. This makes the approach lightweight and workload independent. This study shows that AUDIBLE achieves high system utilization while enforcing stringent requirements on server capacity violations.

Complete list of accepted publications by Microsoft researchers

Amanda: Unified Instrumentation Framework for Deep Neural Networks
Yue Guan, Yuxian Qiu, and Jingwen Leng; Fan Yang, Microsoft Research; Shuo Yu, Shanghai Jiao Tong University; Yunxin Liu, Tsinghua University; Yu Feng and Yuhao Zhu, University of Rochester; Lidong Zhou, Microsoft Research; Yun Liang, Peking University; Chen Zhang, Chao Li, and Minyi Guo, Shanghai Jiao Tong University

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines
Seyedali Jokar Jandaghi and Kaveh Mahdaviani, University of Toronto; Amirhossein Mirhosseini, University of Michigan; Sameh Elnikety, Microsoft Research; Cristiana Amza and Bianca Schroeder, University of Toronto, Cristiana Amza and Bianca Schroeder, University of Toronto

Characterizing Power Management Opportunities for LLMs in the Cloud
(opens in new tab)
Pratyush Patel, Microsoft Azure and University of Washington; Esha Choukse (opens in new tab), Chaojie Zhang (opens in new tab), and Íñigo Goiri (opens in new tab), Azure Research; Brijesh Warrier (opens in new tab), Nithish Mahalingam, Ricardo Bianchini (opens in new tab), Microsoft AzureResearch

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety
Nathaniel Wesley Filardo, University of Cambridge and Microsoft Research; Brett F. Gutstein, Jonathan Woodruff, Jessica Clarke, and Peter Rugg, University of Cambridge; Brooks Davis, SRI International; Mark Johnston, University of Cambridge; Robert Norton, Microsoft Research; David Chisnall, SCI Semiconductor; Simon W. Moore, University of Cambridge; Peter G. Neumann, SRI International; Robert N. M. Watson, University of Cambridge

CrossPrefetch: Accelerating I/O Prefetching for Modern Storage
Shaleen Garg and Jian Zhang, Rutgers University; Rekha Pitchumani, Samsung; Manish Parashar, University of Utah; Bing Xie, Microsoft; Sudarsun Kannan, Rutgers University

Kimbap: A Node-Property Map System for Distributed Graph Analytics
Hochan Lee, University of Texas at Austin; Roshan Dathathri, Microsoft Research; Keshav Pingali, University of Texas at Austin

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization
Cong Li and Zhe Zhou, Peking University; Yang Wang, Microsoft Research; Fan Yang, Nankai University; Ting Cao and Mao Yang, Microsoft Research; Yun Liang and Guangyu Sun, Peking University

Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs
Srikant Bharadwaj, Microsoft Research; Shomit Das, Qualcomm; Kaushik Mazumdar and Bradford M. Beckmann, AMD; Stephen Kosonocky, Uhnder

Conference organizers from Microsoft

Program Co-Chair

Madan Musuvathi

Submission Chairs

Jubi Taneja
Olli Saarikivi

Program Committee

Abhinav Jangda (opens in new tab)
Aditya Kanade (opens in new tab)
Ashish Panwar (opens in new tab)
Jacob Nelson (opens in new tab)
Jay Lorch (opens in new tab)
Jilong Xue (opens in new tab)
Paolo Costa (opens in new tab)
Rodrigo Fonseca (opens in new tab)
Shan Lu (opens in new tab)
Suman Nath (opens in new tab)
Tim Harris (opens in new tab)

External Review Committee

Rujia Wang

Career opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Research, and other departments. We are always pushing the boundaries of computer systems to improve the scale, efficiency, and security of all our offerings. You can review our open research-related positions here.

The post Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications appeared first on Microsoft Research.

Read More