Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit

Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit

This research was presented as a demonstration at the 40th IEEE International Conference on Data Engineering (opens in new tab) (ICDE 2024), one of the premier conferences on data and information engineering.

ICDE conference logo, in white, on the left side of the graphic. To the right, the first page of the accepted paper,

Since the inception of cloud computing, autoscaling has been an essential technique for optimizing resources and performance. By dynamically adjusting the number of computing resources allocated to a service based on current demand, autoscaling ensures that the service can handle the load efficiently while optimizing costs. However, developing and fine-tuning autoscaling algorithms, which govern this process, present significant challenges. The complexity and cost associated with testing these algorithms can lead to inefficient resource management and impede the development of more effective autoscaling strategies.

In our paper, “VASIM: Vertical Autoscaling Simulator Toolkit,” presented at ICDE 2024, we introduce a tool designed to address the complexities involved in assessing autoscaling algorithms. While existing simulation tools cover a range of capabilities, such as energy efficiency and fault tolerance, VASIM stands out by evaluating the critical recommender component within the algorithm and suggesting optimal resource scaling actions based on usage data, balancing performance and cost. This enables developers to iterate more rapidly, enhancing algorithmic performance, and improving resource efficiency and cost savings.

VASIM’s user-friendly interface simplifies the evaluation of autoscaling policies, as illustrated in Figure 1. First steps entail uploading historical data and defining autoscaling policies, including the algorithm and its parameters, shown in the left panel. The Simulation Run feature enables the modification of algorithm parameters, imported via a configuration file, and the execution of simulations based on the selected trace. A results screen displays the CPU limits determined by the selected policies as well as the actual CPU usage tailored to these limits. Additionally, VASIM provides fundamental metrics like throttling incidents, number of scaling operations, and amount of unused capacity, or slack, for the current simulation.

[On the left] Image of VASIM user interface. On the left panel, it has options to select from “Simulation Run”, “Simulation Tuning”, “Simulation Tuning History”. Option “Simulation Run” is selected. Below user has loaded a trace from csv file on disk (c_26742_perf_event_log.csv), algorithm C, metadata config json file from disk. Button “Visualize workload” was clicked and loaded trace is displayed. 

[On the right] On the right panel, user picked other parameters for simulation run (lag – how often recommender gives decision and initial core count) and algorithm parameter from json are shown for edit. 

Image of VASIM UI when simulation was run for selected algorithm, trace and parameter setting. It shows a graph with cpu usage in blue and the limit calculated by selected algorithm in red. It is different from the trace plot that was shown before because calculated limits were below cpu utilization, so the latter was cut off. On top of the plot it shows metrics of the simulation like average slack, average insufficient CPU, sum slack, sum insufficient CPU, number of scalings, number of times of insufficient CPU etc.
Figure 1. The VASIM user interface comprises a run simulation pane on the left and a results pane on the right.

VASIM achieves several important goals:

Resource efficiency and cost reduction. VASIM reduces costs by removing the need to test scaling operations in real-time, which would be resource intensive. This enables developers to adjust algorithms iteratively in a controlled, cost-efficient environment, accelerating development cycles. Because the tool allows users to upload CPU performance history and algorithm parameters, it delivers the results of scaling operations across the entire workload in minutes rather than hours.

Multi-objective optimization. It’s challenging to develop an autoscaling method that handles conflicting parameters. VASIM makes this easier by applying Pareto optimization techniques (opens in new tab), helping developers to find a balance among key metrics. Figure 2 depicts scatter plots for two metrics: average slack and average insufficient CPU. It also shows three optimization objectives: the optimal amount of slack, throttling, and number of scaling operations.

[On the left] A graph that plots the average slack on the Y axis and the average insufficient cpu on the X axis. It shows that the more average insufficient cpu decreases, the more average slack increases. There are six points in red that are pareto frontier points, all on the very edge of the graph but not too close to each other, showing some possible choices of configuration. 

[On the right] A 3D scatter plot displays the total slack on the X axis, cpu total throttle on the Y axis, and the amount of scalings in Z axis. It shows that as you aim to lower total slack and throttle, the amount of scalings increases.
Figure 2. The 2D diagram on the left shows a scatter plot of tuning with Pareto points. The 3D graph on the right shows a scatter plot with the three objectives.

Recommender algorithm testing. VASIM simplifies the process of testing and evaluating recommendation algorithms across diverse workloads. With all tuning jobs running in parallel, computation occurs more quickly, allowing users to efficiently adjust their recommender parameters as necessary. To assess the algorithm’s generalizability, we ran VASIM against 11 available open cluster traces (opens in new tab) for benchmarking and internal product workload traces. This enabled us to evaluate the algorithms’ robustness across a variety of workload types, including cyclical, bursty, and monotonic variations, demonstrating their reliability across different scenarios.

Versatility and adaptability. VASIM provides users with the flexibility to modify components, experiment with recommendation strategies, and evaluate the impact of changes in a controlled and customizable environment. Figure 3 shows the results of a simulation run on the same algorithm and historical performance data but with different parameters. This versatility ensures that infrastructure engineers can tailor the system to meet their needs, enhancing the overall effectiveness of their autoscaling strategies.

These graphs display VASIM running an identical algorithm on the same historical data but with varying parameters, affecting slack, throttling, and the frequency of scaling events. The objective is to maintain a minimal gap between the peak and the lowest resource utilization levels (the top of the bottom line and the bottom of the top line, respectively), and to reduce the space between the response lag indicated by the trailing edges to the left of the lines. Simultaneously, it's important to minimize the occurrence of scaling events to prevent disruptions in workload execution.
Figure 3. These graphs show VASIM running an identical algorithm on the same historical data but with varying parameters, affecting slack, throttling, and the frequency of scaling events. The objective is to maintain a minimal gap between the peak and the lowest resource utilization levels—the top of the bottom line and the bottom of the top line, respectively. The goal is also to reduce the space between the response lag indicated by the trailing edges to the left of the lines. Simultaneously, it’s important to minimize the occurrence of scaling events to prevent disruptions in workload execution.

Optimizing scalability and costs in Kubernetes environments

Our research on vertically autoscaling monolithic applications with a container-as-a-service algorithm (opens in new tab) helped us to better understand the tradeoffs between cost and availability that different algorithm variations introduce. Because VASIM is similar to standard autoscaling architecture (as in the Kubernetes Vertical Pod Autoscaler (opens in new tab) [VPA]) it allows us to test autoscaling algorithms for pods, applications, and virtual machine (VM) capacity. This is possible because these systems share similar components, including resource updaters, controllers, and recommenders. Despite differences in specific systems, their underlying architectures are sufficiently similar, enabling VASIM to effectively mimic them, as shown in Figure 4.

 
The image depicts how VASIM works. It has a Simulation Controller in the middle, which asks Recommender for decisions using one of the algorithms, Simulation Scaler with a scale function, Cloud State Provider to get traces and use them for time simulation, Analyzer to get metrics after each run. Params Tuning Controller tells Simulation Controller to run for every tuning setting and calls Analyzer to get pareto front to find tradeoff between multiple goals after multiple configs were evaluated. Recommender also needs data from Cloud State Provider to access historical data.
Figure 4. VASIM architecture mimics the main components of general autoscaling architectures, allowing users to parametrize those modules to fit their specific needs.
 

Implications and looking ahead

Looking forward, we plan to broaden the scope of VASIM’s support beyond just CPUs to include a wide range of resources, such as memory, disk I/O, and network bandwidth. This expansion will provide future users with a comprehensive understanding of system performance and enable them to make more accurate decisions regarding system management and resource optimization. Additionally, a deeper understanding of system performance will help inform proactive optimization strategies focused on maximizing system efficiency and performance.

The post Enhanced autoscaling with VASIM: Vertical Autoscaling Simulator Toolkit appeared first on Microsoft Research.

Read More

MatterSim: A deep-learning model for materials under real-world conditions

MatterSim: A deep-learning model for materials under real-world conditions

The image features a complex network of interconnected nodes with a molecular structure, illuminated in blue against a dark background.

In the quest for groundbreaking materials crucial to nanoelectronics, energy storage, and healthcare, a critical challenge looms: predicting a material’s properties before it is even created. This is no small feat, with any combination of 118 elements in the periodic table, and the range of temperatures and pressures under which materials are synthesized and operated. These factors drastically affect atomic interactions within materials, making accurate property prediction and behavior simulation exceedingly demanding.

Here at Microsoft Research, we developed MatterSim, a deep-learning model for accurate and efficient materials simulation and property prediction over a broad range of elements, temperatures, and pressures to enable the in silico materials design. MatterSim employs deep learning to understand atomic interactions from the very fundamental principles of quantum mechanics, across a comprehensive spectrum of elements and conditions—from 0 to 5,000 Kelvin (K), and from standard atmospheric pressure to 10,000,000 atmospheres. In our experiment, MatterSim efficiently handles simulations for a variety of materials, including metals, oxides, sulfides, halides, and their various states such as crystals, amorphous solids, and liquids. Additionally, it offers customization options for intricate prediction tasks by incorporating user-provided data.

Figure 1: There are two subfigures. On the left-hand side, atomic structures of 12 materials belonging to metals, oxides, sulfides, halides, and organic molecules are shown. On the right-hand side, the temperature and pressure ranges of materials' application and synthesis are plotted.
Figure 1. MatterSim can model materials properties and behaviors under realistic temperature and pressure conditions for wide ranges of applications.

Simulating materials under realistic conditions across the periodic table

MatterSim’s learning foundation is built on large-scale synthetic data, generated through a blend of active learning, generative models, and molecular dynamics simulations. This data generation strategy ensures extensive coverage of material space, enabling the model to predict energies, atomic forces, and stresses. It serves as a machine-learning force field with a level of accuracy compatible with first-principles predictions. Notably, MatterSim achieves a10-fold increase in accuracy for material property predictions at finite temperatures and pressures when compared to previous state-of-the-art models. Our research demonstrates its proficiency in simulating a vast array of material properties, including thermal, mechanical, and transport properties, and can even predict phase diagrams.

Figure 2: There are three subfigures. The panel on the left shows a comparison of the highest phonon frequency predicted by MatterSim and by first-principles methods. The two values are for each material is very close, leading to a nearly straight line in the parity plot. The middle panel depicts the same relation of free energies of around 50 materials and comparison between MatterSim and first-principles results. The right panel shows the phase diagram of MgO predicted using MatterSim. The x-axis denotes the temperature and the y-axis denotes the pressure. The pressure ranges of where MgO’s B1 phase is below 500 GPa and this range decreases with temperature increase. The blue lines show the prediction from MatterSim and fits well with the shaded region which is the result from experiment measurement.
Figure 2. MatterSim achieves high accuracy in predicting mechanical properties, vibrational properties, and phases diagrams of material comparable to quantum mechanics and experimental measurements. The figure shows the comparison between the predicted properties and the experimental measured results. 

Adapting to complex design tasks

While trained on broad synthetic datasets, MatterSim is also adaptable for specific design requirements by incorporating additional data. The model utilizes active learning and fine-tuning to customize predictions with high data efficiency. For example, simulating water properties — a task seemingly straightforward but computationally intensive — is significantly optimized with MatterSim’s adaptive capability. The model requires only 3% of the data compared to traditional methods, to match experimental accuracy that would otherwise require 30 times more resources for a specialized model and exponentially more for first-principles methods.

Figure 3: There are two panels in this figure. The right panel shows the structure of Li2B12H12, a complex material system used for solid-state batteries. This system is used in the benchmark of the performance of MatterSim. The left panel panels show the comparison between number of data point needed to train a model from scratch and customize from MatterSim to achieve the same accuracy. MatterSim requires 3% and 10% of the data for the two tasks compared with training from scratch.
Figure 3. MatterSim achieves high data efficiency with 90%-97% data save for complex simulation tasks.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


Bridging the gap between atomistic models and real-world measurements

Translating material properties from atomic structures is a complex task, often too intricate for current methods based on statistics, such as molecular dynamics. MatterSim addresses this by mapping these relationships directly through machine learning. It incorporates custom adaptor modules that refine the model to predict material properties from structural data, eliminating the need for intricate simulations. Benchmarking against MatBench (opens in new tab), a renowned material property prediction benchmark set, MatterSim demonstrates significant accuracy improvement and outperforms all specialized property-specific models, showcasing its robust capability in direct material property prediction from domain-specific data.

Looking ahead 

As MatterSim research advances, the emphasis is on experimental validation to reinforce its potential role in pivotal sectors, including the design of catalysts for sustainability, energy storage breakthroughs, and nanotechnology advancements. The planned integration of MatterSim with generative AI models and reinforcement learning heralds a new era in the systematic pursuit of novel materials. This synergy is expected to revolutionize the field, streamlining guided creation of materials tailored for diverse applications ranging from semiconductor technologies to biomedical engineering. Such progress promises to expedite material development and bolster sustainable industrial practices, thereby fostering technological advancements that will benefit society. 

The post MatterSim: A deep-learning model for materials under real-world conditions appeared first on Microsoft Research.

Read More

LLM profiling guides KV cache optimization

LLM profiling guides KV cache optimization

This research paper was presented at the 12th International Conference on Learning Representations (opens in new tab) (ICLR 2024), the premier conference dedicated to the advancement of deep learning.

White ICLR logo to the left of the first page of the accepted paper, “Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs” on a purple background.

Large language models (LLMs) rely on complex internal mechanisms that require more memory than what is typically available to operate on standard devices. One such mechanism is the key-value (KV) cache, which stores and retrieves previously computed data, helping the model generate responses quickly without needing to recalculate information it has already processed. This method uses a substantial amount of memory because it keeps a large amount of this data readily accessible to enhance the model’s speed and efficiency. Consequently, the KV cache can become prohibitively large as the complexity of the tasks increases, sometimes requiring up to 320 GB for a single operation. To address this, we developed FastGen, a novel method aimed at reducing the memory demands for LLMs.

Our paper, “Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (opens in new tab),” presented at ICLR 2024, we describe how FastGen optimizes the way LLMs store and access data, potentially cutting memory use by half while preserving their efficiency. This approach represents a significant step toward making sophisticated AI tools more accessible and affordable for broader applications. We are honored to share that this paper has been awarded an Honorable Mention for the Outstanding Paper Award (opens in new tab).

Observations of the KV cache

The development of FastGen is underpinned by our observations of how the KV cache functions. We first observed that not all the data in the KV cache is needed for LLMs to complete their required tasks, as shown in Figure 1. By providing the KV cache with the mechanism to discard unnecessary data, it is possible to significantly cut memory use. For example, some LLM modules don’t require broad contexts to process input. For this, it is possible to construct a KV cache that removes data that contains less important long-range contexts, such as several sentences or paragraphs. Also, some LLM modules primarily attend only to special tokens, such as punctuation, for which it is possible to create a KV cache that retains only those tokens. Finally, some LLM modules broadly need all tokens, and for these we can employ the standard KV cache and store all words.  

Another key observation in our study is that attention modules in different layers and positions in the LLM behave differently and need different preferences for their KV cache, as shown on the right in Figure 1. 

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


Graphs depicting the different structures of the KV cache. The graph on the left contains common structures. The circle graphs on the right contain compositions of three modules that are in the same layer, but the way they store data is different.
Figure 1: These graphs depict the different structures of the KV cache. The graph on the left contains common structures. The circle graphs on the right contain compositions of three modules that are in the same layer, but the way they store data is different.

FastGen accounts for the diversity of KV cache structures

Because different KV caches have different structures, they need to be handled differently. We based the development of the FastGen algorithm on our observations, enabling it to categorize and optimize the data that is stored in a given KV cache. FastGen first analyzes the specific behaviors of different modules to understand their structures, a method called profiling. It then uses the results to adjust how data is stored in real-time, making the process more efficient. Our tests show that FastGen can reduce the amount of memory by 50% without sacrificing quality. Additional experiments, discussed in detail in our paper, confirm that the profiling process is crucial and significantly improves the efficiency of the KV cache.  

The broader picture

Fueled by unprecedented advances in data handling and computational capabilities, LLM pretraining has emerged as a cornerstone of deep learning, transforming natural language processing tasks and continuously challenging our understanding of learning and cognition.

However, greater capabilities can bring challenges. As models scale larger, customizing them for specific tasks can become more resource-intensive. At Microsoft Research, we are exploring different approaches to more efficient model editing. A critical strategy involves targeted model profiling, which identifies essential components of a model that align with predefined goals. This profiling informs precise model modifications, optimizing resource use and effectiveness.

The two research projects we are presenting at ICLR 2024 support these goals. Both adopt the profile-then-edit paradigm to address different problems. FastGen reduces memory consumption. Our related work, Post-hoc Attention Steering for LLMs (PASTA), focuses on better controllability. These approaches are designed to be resource-efficient, as they do not require tuning or back propagation. Looking ahead, our goal is to further develop these techniques to improve the resource-efficiency of LLM applications, making them more accessible to a wider audience.  

The post LLM profiling guides KV cache optimization appeared first on Microsoft Research.

Read More

LoftQ: Reimagining LLM fine-tuning with smarter initialization

LoftQ: Reimagining LLM fine-tuning with smarter initialization

This research paper was presented at the 12th International Conference on Learning Representations (opens in new tab) (ICLR 2024), the premier conference dedicated to the advancement of deep learning.

Teal background with ICLR logo on the right (head and face) with LoftQ paper on the right.

Large language models (LLMs) use extensive datasets and advanced algorithms to generate nuanced, context-sensitive content. However, their development requires substantial computational resources. To address this, we developed LoftQ, an innovative technique that streamlines the fine-tuning process—which is used to adapt pre-trained language models to perform well in specialized applications, such as analyzing medical documents. During fine-tuning, the model undergoes additional training on a smaller, task-specific dataset. This results in improved performance, such as more accurate predictions, better understanding of domain-specific language, and more relevant responses in the context of the specialized area.

LoftQ’s strength lies in its ability to combine quantization and adaptive initialization during fine-tuning. Quantization reduces the precision of model parameters, lowering memory and computation needs. This not only accelerates processing but also reduces power consumption. Adaptive initialization closely aligns the model’s parameters to its optimal pre-trained state, preserving its capabilities while minimizing resource use. Our paper, “LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models,” presented at ICLR 2024, details how this method can help make AI technologies more efficient and sustainable. 

How LoftQ works 

LoftQ builds on the principles of LoRA (opens in new tab) and QLoRA (opens in new tab). LoRA is a method that greatly reduces the number of parameters needed for training, decreasing the memory requirements for fine-tuning. QLoRA is a fine-tuning approach that uses 4-bit quantized, frozen weights and low rank adapters, significantly reducing memory requirements while maintaining high performance. This is illustrated in Table 1, which shows the amount of memory needed for fine-tuning an LLM with 7 billion parameters as well as the memory requirements for LoRA and QLoRA. LoRA achieves a fourfold reduction in memory usage, and QLoRA further reduces it by twofold.

LoftQ - Table 1: This table shows the GPU memory usage for a 7-billion parameter LLM, with the following configurations: full fine-tuning on the left, LoRA in the middle, and QLoRA on the right.
Table 1: This table shows the GPU memory usage for a 7-billion parameter LLM with the following configurations: full fine-tuning on the left, LoRA in the middle, and QLoRA on the right.

Unlike LoRA, QLoRA comes with a tradeoff, where some quality of the pretrained model is sacrificed due to the quantization of weights. LoftQ recognizes this and optimizes the initialization of quantization and low-rank adaptation matrices. That is, LoftQ seeks to identify a combination of a quantized matrix and a low rank matrix such that their sum closely approximates the original pretrained weight. This is done for every matrix that would be adapted in the model.

The LoftQ algorithm alternates between two primary steps. First it quantizes (simplifies) the weights, and then it finds the best low-rank factors that approximate the quantization between the pretrained weight and the low-rank weight. The process repeats for a few steps. This method enables the fine-tuning process to start from a more effective initial state, which preserves accuracy while using less computational power and much more simplified weights.

LoftQ requires a one-time setup to simplify and prepare these weights, allowing a fixed portion of the model’s parameters (e.g., 5 percent) to be adjusted. Once established, this configuration can be repeatedly applied as the model transitions between various tasks and settings.

Evaluating LoftQ 

Tests using various types of LLMs, including those with different combinations of encoding and decoding capabilities like the Llama-2, show that models initialized with LoftQ consistently achieve strong performance, often matching or surpassing those configured with QLoRA.

In practical terms, comparing the performance of LoftQ and QLoRA on different tasks using the Llama-2 model family yields distinct results, which are highlighted in Table 2. For the WikiText-2 dataset, which measures the model’s perplexity (lower is better), and the GSM8K dataset, which tests the model’s ability to solve basic math problems (higher is better), we demonstrate the effectiveness of varying degrees of weight simplification—averaging 3, 2.5, and 2.25 bits per weight. Our paper discusses the results in more detail. 

LoftQ - Table 2. This table compares LoftQ and QLoRA during the fine-tuning of two Llama-2 models on the Wikitext-2 and GSM8K datasets.
Table 2. This table compares LoftQ and QLoRA during the fine-tuning of two Llama-2 models on the Wikitext-2 and GSM8K datasets.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Implications and looking forward 

LoftQ promises to advance the field of AI by accelerating research and facilitating the creation of cutting-edge tools while supporting sustainable development. While initially focused on LLMs, LoftQ’s flexible design also supports fine-tuning in other types of models, such those for vision and speech technologies. As our research progresses, we expect to make further enhancements that will boost performance on downstream tasks. We hope these improvements will lead to broader adoption across various AI applications. We’re excited about the breadth of this technology’s applicability and encourage the AI community to explore its benefits. LoftQ is available as open source through the Hugging Face PEFT library (opens in new tab).

The post LoftQ: Reimagining LLM fine-tuning with smarter initialization appeared first on Microsoft Research.

Read More

Abstracts: May 6, 2024

Abstracts: May 6, 2024

Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Researcher Michel Galley joins host Gretchen Huizinga to discuss “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts,” which was accepted at the 2024 International Conference on Learning Representations (ICLR). MathVista, an open-source benchmark, combines new and existing data to measure how good models are at solving a variety of math problems that involve processing images as well as text, helping to gain insight into their reasoning capabilities.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

My guest today is Dr. Michel Galley, a senior principal researcher at Microsoft Research. Dr. Galley is the coauthor of a paper called “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.” Michel, thanks for joining us on Abstracts today!


MICHEL GALLEY: Thank you for having me.

HUIZINGA: So I like to start with a distillation or sort of an elevator pitch of your research. Tell us in just a couple sentences what problem or issue your paper addresses and why we should care about it.

GALLEY: So this paper is about evaluating large foundation models. So it’s a very important part of researching large language models because it’s a good way to evaluate, kind of, the capabilities—what these models are good at and not good at. And a part of the focus of MathVista is to evaluate these large foundation models in a multimodal setup, so when the input to the model is actually not just text but also text and images. And then, an example of a task that such a model would perform is, like, the input is maybe a mathematical question, and then there’s some visual support to that question, let’s say, of an image of a graph, and then the model has to respond to something related to that. And why this is important … there has been a lot of work, of course, on large foundation model. Especially when it comes to reasoning tasks, like mathematical reasoning, a lot has focused more on written form.

HUIZINGA: Yeah …

GALLEY: So MathVista is one of the very first datasets that has input that is both images and text.

HUIZINGA: Yeah, yeah. Well, reading your paper, it seems like this is an area that hasn’t been studied systematically. In fact, you actually say that! And say that the field is largely unexplored. But quickly tell us what has been done in this field, and then tell us how your research addresses the proverbial gap in the literature.

GALLEY: Well, there has been a lot of work on vision and language in other problems, like not just about reasoning. Maybe let me just mention why reasoning is important. So one reason I think it’s very interesting to evaluate these large language models in terms of reasoning skill is that we evaluate their capabilities beyond just memorization. So as many of your listeners probably know, these large foundation models are trained on large amounts of text that is public data from various sources. So when you ask a question to a large foundation model, it could be the case, in many cases, that it just memorizes things it has seen in the data.

HUIZINGA: Sure.

GALLEY: So what makes it interesting in terms of reasoning, the answer oftentimes is not there in the data. So it needs to develop this ability to connect the dots between various pieces of information to come up with a new answer. So the focus of our paper is really on mathematical reasoning, but it goes also a bit beyond that because what is also represented in the data is also science question and so on.

HUIZINGA: Yeah …

GALLEY: So this reasoning part has largely focused, until MathVista, on text-only modalities.

HUIZINGA: Yeah …

GALLEY: So it’s one of our very first ones that combines text and images in terms of evaluating these large foundation models. So you ask about what was done before. So, yes, there has been a lot of work, text only, on reasoning, for example, the mathematical question that’s just based on text. And there has been a different stream of work that was much more focused on vision. A lot of work has been on tasks such as visual question answering …

HUIZINGA: Yeah …

GALLEY: … where basically, you have an image and the question is about answer a question about this image. So, yes, we’re trying to fuse the two lines of research here.

HUIZINGA: Right …

GALLEY: And that’s one of the first works that does that.

HUIZINGA: Yeah. Well, let’s talk about your methodology for a minute. Tell us how you went about conducting this research, and what methods did you use?

GALLEY: Yes, sure. So that’s a bit different from a typical, kind of, machine learning paper because the focus on this work is really on benchmarking on the dataset. So the methodology is more about how we collect the data, process it. So they have two components to doing that. One was to look at existing data that already combines vision and text. And there are existing datasets that are actually already fairly big but that were not focused on reasoning. So we use those existing datasets and look for instances in the data that actually include some mathematical or science reasoning. And so that part is leveraging existing datasets, but the important part is, like, we really want to carve out what was interesting piece in terms of reasoning. And we had different stages of processing the data to identify the subset that was reasoning-based. So one first step was basically to apply some automatic filter to determine whether or not a given example, let’s say something that is visual and text, is actually … involves some mathematical reasoning. So we have different strategy. For example, if the answer is numerical, it’s likely that it might be something mathematically related. But that’s just the first stage. And the second stage, we actually had humans, annotators, just certify that the selected data is actually of high quality. So we do have an example of, “Oh, this is mathematical, and that’s either mathematical or scientific,” and so on. And that’s one part of the effort. The other part is that we realized while we collected the data, there are certain types of mathematical reasoning or related to mathematical reasoning that were not represented in the data. So we created three new datasets as part of MathVista. So when I said dataset, it’s more like, think of MathVista as like an aggregate of different types of data, and we added three of them, three new types of data. One is what you call PaperQA, which is basically data that is collected from scientific papers on arXiv, and that had questions asking about that paper and that included some visual components from the paper, typically a plot or a figure.

HUIZINGA: Yeah …

GALLEY: And then we had IQTest, which is basically, I mean, it’s vaguely related mathematically, but basically it also, kind of, tried to see maybe more abstractive thinking about maybe some input that is both text and visual. And the final is about FunctionQA, that is basically algebraic reasoning and function plots and so on.

HUIZINGA: OK …

GALLEY: The important part was actually to identify among vast amounts of data what is actually very interesting in terms of mathematical reasoning.

HUIZINGA: Yeah …

GALLEY: So that part, I think, was quite a big part of doing that work—finding existing data but also creating new data.

HUIZINGA: Yeah, yeah. Well, my favorite part of a research paper is where it says, “and what we found was … ,” so talk a little bit about your results. What did you find?

GALLEY: So we evaluated a wide variety of models, including GPT-4, Claude 2, GPT-4V, multimodal Bard, and LLaVA, and we categorized them into three categories. So one is text only. So, basically, you take a model that is by default just text, and we give it the text part of the question and ask it to answer the question. Of course, that’s, kind of, a bit of a, it’s a difficult task because oftentimes [LAUGHTER] we crucially build these questions so that you have to rely on the vision part. But that’s for, you know, scientific investigation to know how well they can do, and so that’s one category of model. A different category is still text only but that is given the detection from the image. So on the image, we do OCR. So we convert those words from images to text. It’s kind of an extension of the text-based model, except that what was images is translated into text, and then the input to the model is word only, and that’s a different category of model. And the third one is basically truly multimodal model. And what we found, I mean, not surprisingly, it’s, kind of, the one that was doing most poorly is the one that is text only. The second is text plus OCR. And then finally, the one that does best is the multimodal like GPT-4V. But while the ordering between these three categories makes sense, it was a bit surprising that maybe the gap between multimodal and text plus OCR was not bigger. Well, it’s big, but maybe not as big as we were expecting. So, for example, the best detection from the images model achieved like 35 percent accuracy while GPT-4V was 50 percent. So it’s a substantial gap but not huge.

HUIZINGA: Right. Just to clarify, you’re saying OCR. What does that stand for?

GALLEY: [Optical] character recognition.

HUIZINGA: Gotcha.

GALLEY: So, basically, it’s the task of taking text, sometimes typed, but sometimes written, and convert this into the actual text like you would have in a text file.

HUIZINGA: Right. Michel, does any of this have to do with the difficulty of the math problems that you present these models with? I mean, it seems to me, similar to humans, that the easier the problem, the easier it would be for the machine. So at what level of math are we talking for these tests?

GALLEY: What’s nice about MathVista is there’s continuum [of] different difficulties. So the spectrum is quite broad, going from elementary school to more advanced concepts such as calculus. So it’s quite broad. So in the paper, we do have this, kind of, broken down by level. So the number I gave you, like 50 percent, is an aggregate over all the difficulties. But …

HUIZINGA: Gotcha.

GALLEY: But the goal there was really, kind of, to compare different models, but we do have a fair amount of analysis in the appendix. Actually, we have 100 pages of appendices of plenty of analysis and so on. So if people, I mean …

HUIZINGA: I saw that. I saw the length of the paper, and I’m going, what? [LAUGHS] That’s a LONG paper! Well, research in the lab is one thing, I always like to say, but understanding real-world impact is important, too. So where’s this work going to make the most difference, and who does it help most at this point?

GALLEY: Well, I think perhaps that’s the main point of this kind of line of work in terms of reasoning is that when looking at this difficult problem that are mathematical, actually it’s a way to, kind of, abstract away maybe more complex capabilities, and I think while thinking just about mathematics might seem a bit narrow, I don’t think that really is. It’s more about seeing whether this model has the ability to do, kind of, multistep kind of processing of your input and think maybe somewhat intelligently about a given problem. So we focus mostly on math. There is some science, but we would be very interested, especially in future work, to, kind of, go beyond that.

HUIZINGA: OK, well, let me press in a little bit there because … just say I’m a regular person using a GPT model. Is your work more addressed upstream from that to the research community to say, how do we get these models to be better so that downstream people like me can be more confident of the models?

GALLEY: Yes, I would say at the moment, I mean, this line of work is perhaps more geared towards somewhat more research community, but I think it could be some seed for researchers to think about some applications perhaps that also requires some kind of step-by-step reasoning but perhaps not going beyond math.

HUIZINGA: Yeah. Michel, if there was one thing you wanted our listeners to take away from this research, kind of golden nugget, what would it be?

GALLEY: Well, I would say it’s the challenging part of these datasets. I think that’s what makes MathVista stand out compared to other datasets. By now, there are a few other vision and language datasets, and of course, many that are more text-based. And we’ve seen, for example, some recent papers showing that actually MathVista remains one of the most challenging ones. So I think it’s probably going to stay around for a while because of the difficulty it represents. So it’s open source of available datasets that everybody can use, and I very much encourage people to use it.

HUIZINGA: Is it on GitHub?

GALLEY: Yes, it’s on GitHub.

HUIZINGA: So what’s next on the research agenda for helping LLMs get better at math, Michel? What are the big challenges in the field yet? I mean, you’ve alluded to many of them already, sort of, but what’s next on your research agenda?

GALLEY: Well, I would say what we found so far is these models are very good at processing the textual part of problems it’s given, to the model, but you have the equivalent in images actually harder somehow. So I think a lot more work needs to be done in terms of vision capabilities, in terms of reasoning over images, because the capabilities you will see in text are actually quite advanced, whereas the equivalent in images doesn’t seem that good. I mean, a fair disclaimer: my background is more on the text side, [LAUGHTER] so some of my colleagues on the paper are more on the vision side, so maybe if a listener maybe run into some of our coauthors at the conference, they might want to talk to these vision people because that’s less of my background. [LAUGHS]

HUIZINGA: Well, and if you think about Venn diagrams, you know, you’ve got people that are doing text, people that are doing vision, and then the people that are trying to do both to see how the worlds collide.

[MUSIC]

Well, Michel Galley, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find it on arXiv. You can also read it on the website for the International Conference on Learning Representations, or ICLR. And if you happen to be at the ICLR conference this week, you can hear more about it there. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: May 6, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of April 29, 2024

Research Focus: Week of April 29, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of April 29, 2024

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a program’s intent. However, there is no guarantee that a program’s implementation aligns with its natural language documentation. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. However, this information is often underutilized, due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The “emergent abilities” of large language models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, due to a lack of benchmarks and evaluation metrics, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent—and whether such translation could be useful in practice.

In a new paper: Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? (opens in new tab), researchers from Microsoft describe nl2postcond, the problem leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. The paper, to be presented at the upcoming ACM International Conference on the Foundations of Software Engineering (opens in new tab), introduces and validates metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. The researchers show that nl2postcond via LLMs has the potential to be helpful in practice by demonstrating that LLM-generated specifications can be used to discover historical bugs in real-world projects. 


Semantically Aligned Question and Code Generation for Automated Insight Generation

People who work with data, like engineers, analysts, and data scientists, often must manually look through data to find valuable insights or write complex scripts to automate exploration of the data. Automated insight generation provides these workers the opportunity to immediately glean insights about their data and identify valuable starting places for writing their exploration scripts. Unfortunately, automated insights produced by LLMs can sometimes generate code that does not correctly correspond (or align) to the insight. In a recent paper: Semantically Aligned Question and Code Generation for Automated Insight Generation (opens in new tab), researchers from Microsoft leverage the semantic knowledge of LLMs to generate targeted and insightful questions about data and the corresponding code to answer those questions. Through an empirical study on data from Open-WikiTable (opens in new tab), they then show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. The research also shows that generating questions and code together yields more interesting and diverse insights about data. 


Explaining CLIP’s performance disparities on data from blind/low vision users

AI-based applications hold the potential to assist people who are blind or low vision (BLV) with everyday visual tasks. However, human assistance is often required, due to the wide variety of assistance needed and varying quality of images available. Recent advances in large multi-modal models (LMMs) could potentially address these challenges, enabling a new era of automated visual assistance. Yet, little work has been done to evaluate how well LMMs perform on data from BLV users.

In a recent paper: Explaining CLIP’s performance disparities on data from blind/low vision users (opens in new tab), researchers from Microsoft and the World Bank address this issue by assessing CLIP (opens in new tab), a widely-used LMM with potential to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, their results show that disability objects, like guide canes and Braille displays, are recognized significantly less accurately than common objects, like TV remote controls and coffee mugs—in some cases by up to 28 percentage points difference. 

The researchers perform an analysis of the captions in three large-scale datasets that are commonly used to train models like CLIP and show that BLV-related content (such as guide canes) is rarely mentioned. This is a potential reason for the large performance gaps. The researchers show that a few-shot learning approach with as little as five example images of a disability object can improve its ability to recognize that object, holding the potential to mitigate CLIP’s performance disparities for BLV users. They then discuss other possible mitigations. 

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


Closed-Form Bounds for DP-SGD against Record-level Inference 

Privacy of training data is a central consideration when deploying machine learning (ML) models. Models trained with guarantees of differential privacy (DP) provably resist a wide range of attacks. Although it is possible to derive bounds, or safe limits, for specific privacy threats solely from DP guarantees, meaningful bounds require impractically small privacy budgets, which results in a large loss in utility.
 
In a recent paper: Closed-Form Bounds for DP-SGD against Record-level Inference, researchers from Microsoft present a new approach to quantify the privacy of ML models against membership inference (inferring whether a data record is in the training data) and attribute inference (reconstructing partial information about a record) without the indirection through DP. They focus on the popular DP-SGD algorithm, which they model as an information theoretic channel whose inputs are the secrets that an attacker wants to infer (e.g., membership of a data record) and whose outputs are the intermediate model parameters produced by iterative optimization. They obtain closed-form bounds for membership inference that match state-of-the-art techniques but are orders of magnitude faster to compute. They also present the first algorithm to produce data-dependent bounds against attribute inference. Compared to bounds computed indirectly through numerical DP budget accountants, these bounds provide a tighter characterization of the privacy risk of deploying an ML model trained on a specific dataset. This research provides a direct, interpretable, and practical way to evaluate the privacy of trained models against inference threats without sacrificing utility.

Microsoft Research in the news


TIME100 Most Influential People in Health 

TIME | May 2, 2024

Microsoft Research president Peter Lee is included as an innovator on the 2024 TIME100 Health list, TIME’s inaugural list of 100 individuals who most influenced global health this year.


Sanctuary AI Announces Microsoft Collaboration to Accelerate AI Development for General Purpose Robots 

Sanctuary AI | May 1, 2024

Sanctuary AI and Microsoft are collaborating on the development of AI models for general purpose humanoid robots. Sanctuary AI will leverage Microsoft’s Azure cloud resources for their AI workloads.


Tiny but mighty: The Phi-3 small language models with big potential 

Microsoft Source | April 23, 2024

LLMs create exciting opportunities for AI to boost productivity and creativity. But they require significant computing resources. Phi-3 models, which perform better than models twice their size, are now publicly available from Microsoft.


AI Is Unearthing New Drug Candidates, But It Still Needs Human Oversight 

Drug Discovery Online | April 11, 2024

Drug Discovery Online published a contributed article from Junaid Bajwa discussing how recent advancements in AI offer the potential to streamline and optimize drug development in unprecedented ways.


How AI is helping create sustainable farms of the future 

The Grocer | April 16, 2024

Ranveer Chandra authored an essay on how AI is helping create sustainable farms of the future for UK-based trade outlet, The Grocer.


The Future of AI and Mental Health 

Psychiatry Online | April 16, 2024

Psychiatric News published an article featuring Q&A with Jina Suh, highlighting the important considerations for the use of AI technologies among psychiatrists and mental health professionals.


MatterGen’s Breakthroughs: How AI Shapes the Future of Materials Science 

Turing Post | April 19, 2024

Turing Post covered MatterGen in an interview with Tian Xie. Learn more about this impactful generative model for inorganic materials design.


Machine Learning Street Talk interview with Chris Bishop 

Machine Learning Street Talk | April 10, 2024

Chris Bishop joined Dr. Tim Scarfe for a wide-ranging interview on advances in deep learning and AI for science.

The post Research Focus: Week of April 29, 2024 appeared first on Microsoft Research.

Read More

Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications

Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications

ASPLOS 2024 logo in white on a blue and green gradient background

Modern computer systems and applications, with unprecedented scale, complexity, and security needs, require careful co-design and co-evolution of hardware and software. The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (opens in new tab), is the main forum where researchers bridge the gap between architecture, programming languages, and operating systems to advance the state of the art.

ASPLOS 2024 is taking place in San Diego between April 27 and May 1, and Microsoft researchers and collaborators have a strong presence, with members of our team taking on key roles in organizing the event. This includes participation in the program and external review committees and leadership as the program co-chair.

We are pleased to share that eight papers from Microsoft researchers and their collaborators have been accepted to the conference, spanning a broad spectrum of topics. In the field of AI and deep learning, subjects include power and frequency management for GPUs and LLMs, the use of Process-in-Memory for deep learning, and instrumentation frameworks. Regarding infrastructure, topics include memory safety with CHERI, I/O prefetching in modern storage, and smart oversubscription of burstable virtual machines. This post highlights some of this work.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Paper highlights

Characterizing Power Management Opportunities for LLMs in the Cloud

The rising popularity of LLMs and generative AI has led to an unprecedented demand for GPUs. However, the availability of power is a key limiting factor in expanding a GPU fleet. This paper characterizes the power usage in LLM clusters, examines the power consumption patterns across multiple LLMs, and identifies the differences between inference and training power consumption patterns. This investigation reveals that the average and peak power consumption in inference clusters is not very high, and that there is substantial headroom for power oversubscription. Consequently, the authors propose POLCA: a framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. It can deploy 30% more servers in the same GPU clusters for inference tasks, with minimal performance degradation.

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

PIM-DL is the first deep learning framework specifically designed for off-the-shelf processing-in-memory (PIM) systems, capable of offloading most computations in neural networks. Its goal is to surmount the computational limitations of PIM hardware by replacing traditional compute-heavy matrix multiplication operations with Lookup Tables (LUTs). PIM-DL first enables neural networks to operate efficiently on PIM architectures, significantly reducing the need for complex arithmetic operations. PIM-DL demonstrates significant speed improvements, achieving up to ~37x faster performance than traditional GEMM-based systems and showing competitive speedups against CPUs and GPUs.

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety

Memory safety bugs have persistently plagued software for over 50 years and underpin some 70% of common vulnerabilities and exposures (CVEs) every year. The CHERI capability architecture (opens in new tab) is an emerging technology (opens in new tab) (especially through Arm’s Morello (opens in new tab) and Microsoft’s CHERIoT (opens in new tab) platforms) for spatial memory safety and software compartmentalization. In this paper, the authors demonstrate the viability of object-granularity heap temporal safety built atop CHERI with considerably lower overheads than prior work.

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines

Burstable virtual machines (BVMs) are a type of virtual machine in the cloud that allows temporary increases in resource allocation. This paper shows how to oversubscribe BVMs. It first studies the characteristics of BVMs on Microsoft Azure and explains why traditional approaches based on using a fixed oversubscription ratio or based on the Central Limit Theorem do not work well for BVMs: they lead to either low utilization or high server capacity violation rates. Based on the lessons learned from the workload study, the authors developed a new approach, called AUDIBLE, using a nonparametric statistical model. This makes the approach lightweight and workload independent. This study shows that AUDIBLE achieves high system utilization while enforcing stringent requirements on server capacity violations.

Complete list of accepted publications by Microsoft researchers

Amanda: Unified Instrumentation Framework for Deep Neural Networks
Yue Guan, Yuxian Qiu, and Jingwen Leng; Fan Yang, Microsoft Research; Shuo Yu, Shanghai Jiao Tong University; Yunxin Liu, Tsinghua University; Yu Feng and Yuhao Zhu, University of Rochester; Lidong Zhou, Microsoft Research; Yun Liang, Peking University; Chen Zhang, Chao Li, and Minyi Guo, Shanghai Jiao Tong University

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines
Seyedali Jokar Jandaghi and Kaveh Mahdaviani, University of Toronto; Amirhossein Mirhosseini, University of Michigan; Sameh Elnikety, Microsoft Research; Cristiana Amza and Bianca Schroeder, University of Toronto, Cristiana Amza and Bianca Schroeder, University of Toronto

Characterizing Power Management Opportunities for LLMs in the Cloud
(opens in new tab)
Pratyush Patel, Microsoft Azure and University of Washington; Esha Choukse (opens in new tab), Chaojie Zhang (opens in new tab), and Íñigo Goiri (opens in new tab), Azure Research; Brijesh Warrier (opens in new tab), Nithish Mahalingam, Ricardo Bianchini (opens in new tab), Microsoft AzureResearch

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety
Nathaniel Wesley Filardo, University of Cambridge and Microsoft Research; Brett F. Gutstein, Jonathan Woodruff, Jessica Clarke, and Peter Rugg, University of Cambridge; Brooks Davis, SRI International; Mark Johnston, University of Cambridge; Robert Norton, Microsoft Research; David Chisnall, SCI Semiconductor; Simon W. Moore, University of Cambridge; Peter G. Neumann, SRI International; Robert N. M. Watson, University of Cambridge

CrossPrefetch: Accelerating I/O Prefetching for Modern Storage
Shaleen Garg and Jian Zhang, Rutgers University; Rekha Pitchumani, Samsung; Manish Parashar, University of Utah; Bing Xie, Microsoft; Sudarsun Kannan, Rutgers University

Kimbap: A Node-Property Map System for Distributed Graph Analytics
Hochan Lee, University of Texas at Austin; Roshan Dathathri, Microsoft Research; Keshav Pingali, University of Texas at Austin

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization
Cong Li and Zhe Zhou, Peking University; Yang Wang, Microsoft Research; Fan Yang, Nankai University; Ting Cao and Mao Yang, Microsoft Research; Yun Liang and Guangyu Sun, Peking University

Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs
Srikant Bharadwaj, Microsoft Research; Shomit Das, Qualcomm; Kaushik Mazumdar and Bradford M. Beckmann, AMD; Stephen Kosonocky, Uhnder

Conference organizers from Microsoft

Program Co-Chair

Madan Musuvathi

Submission Chairs

Jubi Taneja
Olli Saarikivi

Program Committee

Abhinav Jangda (opens in new tab)
Aditya Kanade (opens in new tab)
Ashish Panwar (opens in new tab)
Jacob Nelson (opens in new tab)
Jay Lorch (opens in new tab)
Jilong Xue (opens in new tab)
Paolo Costa (opens in new tab)
Rodrigo Fonseca (opens in new tab)
Shan Lu (opens in new tab)
Suman Nath (opens in new tab)
Tim Harris (opens in new tab)

External Review Committee

Rujia Wang

Career opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Research, and other departments. We are always pushing the boundaries of computer systems to improve the scale, efficiency, and security of all our offerings. You can review our open research-related positions here.

The post Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications appeared first on Microsoft Research.

Read More

SIGMA: An open-source mixed-reality system for research on physical task assistance

SIGMA: An open-source mixed-reality system for research on physical task assistance

Blue, purple, pink gradient background with three images: a five item checklist on the left, a sound wave in the middle, and goggles on the right.

Imagine if every time you needed to complete a complex physical task, like building a bicycle, fixing a broken water heater, or cooking risotto for the first time, you had a world-class expert standing over your shoulder and guiding you through the process. In addition to telling you the steps to follow, this expert would also tune the instructions to your skill set, deliver them with the right timing, and adapt to any mistakes, confusions, or distractions that might arise along the way. 

What would it take to build an interactive AI system that could assist you with any task in the physical world, just as a real-time expert would? To begin exploring the core competencies that such a system would require, we developed and released the Situated Interactive Guidance, Monitoring, and Assistance (SIGMA) system, an open-source research platform and testbed prototype (opens in new tab) for studying mixed-reality task assistance. SIGMA provides a basis for researchers to explore, understand, and develop the capabilities required to enable in-stream task assistance in the physical world. 

Left: Stock photo of a man with glasses fixing a bicycle. Middle: Stock photo of a man cooking a meal in a kitchen. Right: Stock photo of a woman fixing the plumbing of a kitchen sink with a wrench while lying on the floor with other tools scattered around.

Recent advances in generative AI and large language, vision, and multimodal models can provide a foundation of open-domain knowledge, inference, and generation capabilities to help enable such open-ended task assistance scenarios. However, building AI systems that collaborate with people in the physical world—including not just mixed-reality task assistants but also interactive robots, smart factory floors, autonomous vehicles, and so on—requires going beyond the ability to generate relevant instructions and content. To be effective, these systems also require physical and social intelligence. 

Physical and social intelligence

For AI systems to fluidly collaborate with people in the physical world, they must continuously perceive and reason multimodally, in stream, about their surrounding environment. This requirement goes beyond just detecting and tracking objects. Effective collaboration in the physical world necessitates an understanding of which objects are relevant for the task at hand, what their possible uses may be, how they relate to each other, what spatial constraints are in play, and how all these aspects evolve over time. 

Just as important as reasoning about the physical environment, these systems also need to reason about people. This reasoning should include not only lower-level inferences about body pose, speech and actions, but also higher-level inferences about cognitive states and the social norms of real-time collaborative behavior. For example, the AI assistant envisioned above would need to consider questions such as: Is the user confused or frustrated? Are they about to make a mistake? What’s their level of expertise? Are they still pursuing the current task, or have they started doing something else in parallel? Is it a good time to interrupt them or provide the next instruction? And so forth.

Situated Interactive Guidance, Monitoring, and Assistance

We developed SIGMA as a platform to investigate these challenges and evaluate progress in developing new solutions.

Left: A person using SIGMA running on a HoloLens 2 to perform a procedural task. Middle: First-person view showing SIGMA’s task-guidance panel and task-specific holograms. Right: 3D visualization of the system's scene understanding showing the egocentric camera view, depth map, detected objects, gaze, hand and head pose.
Left: A person using SIGMA running on a HoloLens 2 to perform a procedural task. Middle: First-person view showing SIGMA’s task-guidance panel and task-specific holograms. Right: 3D visualization of the system’s scene understanding showing the egocentric camera view, depth map, detected objects, gaze, hand and head pose.

SIGMA is an interactive application that currently runs on a HoloLens 2 device and combines a variety of mixed-reality and AI technologies, including large language and vision models, to guide a user through procedural tasks. Tasks are structured as a sequence of steps, which can either be predefined manually in a task library or generated on the fly using a large language model like GPT-4. Throughout the interaction, SIGMA can leverage large language models to answer open-ended questions that a user might have along the way. Additionally, SIGMA can use vision models like Detic and SEEM to detect and track task-relevant objects in the environment and point them out to the user as appropriate. This video (opens in new tab) provides a first-person view of someone using SIGMA to perform a couple of example procedural tasks.

Enabling research at the intersection of AI and mixed reality

SIGMA was designed to serve as a research platform. Our goal in open-sourcing the system is to help other researchers leapfrog the basic engineering challenges of putting together a full-stack interactive application and allow them to directly focus on the interesting research challenges ahead.

Several design choices support these research goals. For example, the system is implemented as a client-server architecture: a lightweight client application runs on the HoloLens 2 device (configured in Research Mode (opens in new tab)), which captures and sends a variety of multimodal data streams—including RGB (red-green-blue), depth, audio, head, hand, and gaze tracking information—live to a more powerful desktop server. The desktop server implements the core functionality of the application and streams information and commands to the client app for what to render on the device. This architecture enables researchers to bypass current compute limitations on the headset and creates opportunities for porting the application to other mixed-reality devices. 

SIGMA is built on top of Platform for Situated Intelligence (opens in new tab) (also known as psi), an open-source framework that provides the fabric, tools, and components for developing and researching multimodal integrative-AI systems. The underlying psi framework enables fast prototyping and provides a performant streaming and logging infrastructure. The framework provides infrastructure for data replay, enabling data-driven development and tuning at the application level. Finally, Platform for Situated Intelligence Studio provides extensive support for visualization, debugging, tuning and maintenance. 

An animated gif depicting the Platform for Situated Intelligence Studio visualization tool. Various 2D, 3D, and timeline streams are shown over a 10-second clip of a user interacting with SIGMA, such as the egocentric camera view, depth map, head pose, audio, speech recognition results, etc.
Platform for Situated Intelligence Studio is a tool that enables researchers to visualize various data streams collected and debug the application.

SIGMA’s current functionality is relatively simple, but the system provides an important starting point for discovering and exploring research challenges at the intersection of mixed reality and AI. From computer vision to speech recognition, many research problems, especially when it comes to perception, can and have been investigated based on collected datasets. The recently increased interest in egocentric data and associated challenges provides important fuel for advancing the state of the art. Yet, numerous problems that have to do with interaction and with real-time collaboration are only surfaced by real-time end-to-end systems and are best studied and understood in an interactive context with actual users.

As a testament to Microsoft’s continued commitment to the space, SIGMA provides a research platform and reflects just one part of the company’s work to explore new AI and mixed-reality technologies. Microsoft also offers an enterprise-ready, mixed-reality solution for frontline workers: Dynamics 365 Guides. With Copilot in Dynamics 365 Guides, which is currently being used by customers in private preview, AI and mixed reality together empower frontline workers with step-by-step procedural guidance and relevant information in the flow of work. Dynamics 365 Guides is a richly featured product for enterprise customers, geared toward frontline workers who perform complex tasks. In comparison, SIGMA is an open-source testbed for exploratory research purposes only. 

We hope that SIGMA can provide a solid foundation for researchers to build on. Although the system targets the specific scenario of mixed-reality task assistance, it can help illuminate the challenges of developing social and physical intelligence that arise for any computing systems that are designed to operate in the physical world and interact with people, from virtual agents to physical robots and devices.

If you are interested in learning more and using SIGMA in your own research, check it out at https://aka.ms/psi-sigma (opens in new tab). We are excited to collaborate with and work alongside the open-source research community to make faster progress in this exciting and challenging space. 

Acknowledgements / Contributors

Ishani Chakraborty, Neel Joshi, Ann Paradiso, Mahdi Rad, Nick Saw, Vibhav Vineet, Xin Wang.

Responsible AI considerations

SIGMA was designed as an experimental prototype for research purposes only and is not intended for use in developing commercial applications. The primary use case is as a research tool to enable academic and industry researchers to push the state of the art in the space of procedural task assistance at the intersection of mixed reality and AI. As such, the system has been open-sourced under a research-only license (opens in new tab). Researchers that wish to make use of SIGMA in their own work should first familiarize themselves with the system and its limitations and risks involved with using the system in a user-study context and should undergo a full IRB or ethical board review as appropriate for their institution. Limitations, risks and additional considerations for using the system are described in a Transparency Note (opens in new tab) available in SIGMA’s open-source repository (opens in new tab).

The post SIGMA: An open-source mixed-reality system for research on physical task assistance appeared first on Microsoft Research.

Read More

Ideas: Exploring AI frontiers with Rafah Hosn

Ideas: Exploring AI frontiers with Rafah Hosn

Microsoft Research Podcast: Ideas - Rafah Hosn

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. 

In this episode, host Gretchen Huizinga talks with Rafah Hosn, partner, group product manager for AI Frontiers at Microsoft Research. Hosn’s professional experience spans the gamut—from research to product to engineering to research again, the discipline’s uniquely high levels of creativity, curiosity, and intellect drawing her back in. Energized by past technical disruptions she’s experienced, Hosn is on what she describes as her “most exciting adventure” yet, helping to drive scientific advancement in AI and to answer a big question: how far can we push machine intelligence while still delivering technologies people can derive value from? 

Transcript 

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

RAFAH HOSN: What has changed is that in the old days, we had the luxury of creating something, going and piloting for three months until we know whether it works or not, and then taking one year to productize! That … that, that doesn’t work anymore! Because guess what? In three months, this innovation is, like, topped by four other innovations, be it at Microsoft or elsewhere. So that speed is really shifting the mindset and the spirit of people. 

[TEASER ENDS] 

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward. 


[MUSIC FADES] 

My guest today is Rafah Hosn. She’s a partner, group product manager for AI Frontiers at Microsoft Research. I’d call Rafah a sort of organizational conductor, working both with leaders to drive clarity around the mission as well as program managers to make sure they have solid operational strategies to execute on it. Rafah has mad skills in bringing research ideas from lab to life, and I’m thrilled to talk to her today. Rafah Hosn, welcome to Ideas

RAFAH HOSN: Thank you, Gretchen. Oh, my goodness, I have to live up to this introduction now! [LAUGHTER] 

HUIZINGA: Well, before we talk about research ideas, let’s talk about you and your own sort of “reason for being” in the research world. How would you describe your motivation for working in research and—assuming there was one—what was the “big idea” or animating “what if?” behind what you’re doing today? 

HOSN: Yeah, you know, I don’t know. There are so many big ideas, to be honest! Every day, I wake up and I often tell my husband how lucky, like so totally lucky and humbled, I am to be where I am right now in this moment, like right now when society as we know it is being totally disrupted by this huge leap in AI. And why research? Well, I’ve tried it all, Gretchen! I’ve been in research, I went to product, I did engineering, and I did full circle and came back to research. Because, you know, for me personally, there’s no other environment that I know of, for me, that has this amount of creativity and just infinite curiosity and intellect. So working with people that are asking “what next?” and trying to imagine the next world beyond where AI is today is just … this is the big idea. This is why I’m here. This is why I’m excited to come to work every day. 

HUIZINGA: Yeah. Well … and I want to drill in a little bit just, sort of, personally because sometimes there’s a story, an origin story, if you will, of some pivotal aha moment that you say, oh, that’s fascinating, that’s cool, that’s what I want to do. Anything that piqued your interest way back when you were a kid or, sort of, a pivotal moment in your educational years? 

HOSN: Yeah, you know, so many different things that inspire you along the journey, right. It’s not just one thing, Gretchen. My dad was a doctor. He was my biggest inspiration growing up. And the reason is because he had a lot of depth of knowledge in his domain. And I wanted that. I wanted to have depth of knowledge in a domain. So I went engineering against his advice. He really wanted me to be a doctor. [LAUGHTER] So he was not too happy. But, you know, throughout my education, you know, I was there when smartphones came about, when the internet was a thing. And now, like with generative AI, I feel like I’ve lived through so many disruptions, and every one of those was, “Oh my gosh! Like, I am exactly where I want to be!” So multiple inspirations, and every day, I wake up and there’s new news and I’m saying to myself, “OK, that’s great.” I love it! 

HUIZINGA: What a time to be alive! 

HOSN: It is amazing!

HUIZINGA: Yeah. Well, you recently took on this new role in AI Frontiers at Microsoft Research. And that very word “frontiers” evokes images of unexplored, uncharted territories like the Wild West or for Trekkies, maybe “space: the final frontier.” So what does it mean to you to be working at the frontier of artificial intelligence, and what’s the big idea behind AI Frontiers? 

HOSN: You know, it’s my biggest and most exciting adventure so far! Working under Ece Kamar’s leadership in this AI Frontiers is really trying to push ourselves to think, what’s beyond what there is right now in artificial intelligence? Where can we push more, from a scientific perspective? How do we translate these scientific discoveries into capabilities that people can actually use and derive value from? It’s a big responsibility, as well, because we just don’t want to push the boundaries of AI for the sake of pushing. We want to push it in a safe and responsible way. So it is a big responsibility. 

HUIZINGA: Yeah … 

HOSN: And fundamentally, you know, the unifying big idea in this team is to explore, you know, how far can we push intelligence further into models and encapsulations of those models so that we can, you know, have not just sort of an assistant but really a personal assistant, an agent that can, kind of, do tasks for us, with us, seamlessly across multiple domains? So this is what we’re trying to push for. 

HUIZINGA: Mmm. Rafah, do you feel like you’re at the frontier of artificial intelligence? I mean, what are the emotions that crop up when you are dealing with these things—that you and your teams basically know about but the rest of us don’t?

HOSN: For most days, it’s excitement. Sometimes it’s [LAUGHTER] … it ranges, to be honest. I would say there’s a spectrum of emotions. The dominating one is really just excitement. There’s so much that has happened with GenAI, but I feel like it has opened up so many different paths, as well, for us to explore, and that’s the excitement. And then every time the world accomplishes something, you’re like in astonishment. You’re like, wow, wow. 

HUIZINGA: Yeah … 

HOSN: And then, and then, oh my gosh, what’s next? And so, it’s a range of emotions … 

HUIZINGA: Right … 

HOSN: … but I would say the dominating one is enthusiasm.

HUIZINGA: Yeah. Well, I’ve heard other people on your teams use words like surprise, sometimes even shock … 

HOSN: Yeah, yeah, there are a lot of “wow” factors. Every day, every day, I wake up, I read like my three favorite AI tweets or things like that, and I’m like, “Oh my gosh. I wouldn’t have imagined that this model could do this thing,” so [LAUGHS] … um, but it’s exciting. 

HUIZINGA: We may have to get those accounts in the show notes so that we can follow along with your surprise and amazement in the mornings! 

HOSN: [LAUGHS] Yes! 

HUIZINGA: Well, listen, when we talk about measuring the success of an AI system, we often use the common convention of what we call benchmarks. But I want to zoom out from AI systems for a minute and ask how you might measure the success of an AI lab, which is what you’re working in. What are your benchmarks or key performance indicators—we call them KPIs—for the work going on at AI Frontiers? 

HOSN: Yeah, so I’m going to start by something that may sound surprising maybe to some, but I think it’s the culture first. It’s the culture of endless curiosity, of enthusiasm coupled with a bit of skepticism, to be honest, to ask the questions, the right questions, and this drive to push further. So I would say one KPI of success for me, personally, is, you know, can we maintain this culture of enthusiasm coupled with skepticism so we can ask hard questions and an envelope of enthusiasm and drive for everyone? So that’s one. I would say the other three are … one is around how much can we push scientifically as a community, right? This is a team of people that are getting together with a mission to push the boundaries of our understanding of artificial intelligence. So are we pushing that scientific boundaries? Are we creating insights, not just for the scientific community, but also for Microsoft and the world, so that we know how to derive value from these discoveries, right? At the end of the day, it is awesome to push scientifically. It’s even more awesome if you take this and translate it into something a human being can use … 

HUIZINGA: Yeah … 

HOSN: … or an enterprise can use. And I think … that’s kind of my KPIs of success. Culture first, pushing on the scientific boundaries, creating insights for the scientific community as well as for Microsoft so we can derive value for us as a society, right. 

HUIZINGA: Yeah. Well, continuing on this idea of success, and you’ve alluded to this already in terms of characteristics of curiosity and so on, part of your job, as you put it, was “enabling brilliant minds to find success.” So talk a little bit about the personal qualities of these brilliant minds and how you help them find success.

HOSN: Yeah, you know, everybody I work with brings different aspects of brilliance to the table—every day. So in our community of engineers, PMs, researchers, everybody is present with their ideas and their strengths. And they’re pulling together to push harder and faster on our key priorities. And I find folks working in AI these days, you know, to have a renewed fire. It’s really amazing to see. And I talk a lot about curiosity, but, you know, I cannot emphasize how much this is driving a lot of our community to explore new paths that they hadn’t thought about prior to this GenAI coming along. And so everybody is showing up, present, asking these questions and trying to solve new scenarios, new problems that are emerging. And from my perspective, you know, as you mentioned, I just try to unblock, basically. My team and I are here to [LAUGHTER] … well, two things I would say. First is bring the outside-in perspective. That’s so important because science is amazing, but unless you can derive value from it, it remains an awesome paper and an awesome equation, right. So asking, who can use this? What are the scenarios it could, you know, light up? How can we derive value? So those are the questions that my team and I can contribute to, and we are trying to participate from ideation all the way to basically delivering on key milestones. And that last mile is so important. Like, once you know what you want to do, how do you structure? How do you have an operational strategy that is amenable to these times, which is fast, fast, fast, and faster? So that’s, kind of, what we’re trying to do here. 

HUIZINGA: Yeah, yeah. Well, two things came to my mind in terms of what kinds of people would end up working in this area. And one would be agility, or agile. And that would, to me, represent in a researcher that the person would be able to spin or pivot if something didn’t work out. And the other one is sort of a risk-reward mentality. It’s like, where are you willing to push to get that reward versus what might keep you from even trying? 

HOSN: Yeah, so definitely in this AI Frontiers community, I’m finding a lot of adaptability. So people willing to try, failing fast when they fail, and pivoting. And you have to, nowadays, in this atmosphere that we are living in. And because we have the privilege of working in research—and it’s really an honor and a privilege, and I’m not saying it just lightly—but it is the place where you can take risks, Gretchen. It is the place where failing is totally fine because you’re learning and you’re pivoting in a way that allows you to progress on the next thing you tackle. So I feel like most of the people I work with in this community, AI Frontiers, we are risk takers. We want to push, and it’s OK to fail, and it’s OK to adapt. So, I think, as an aggregate, that’s kind of the spirit I’m seeing. 

HUIZINGA: In the past, Rafah, you’ve stressed the importance of both teams and timing. And so we’ve been talking about the teams and the minds and the kinds of qualities in those people. But what about the “when” of research? How does timing impact what gets done in your world?

HOSN: Well, in this new era, Gretchen, everything is yesterday! [LAUGHS] I mean, it is true. AI research is moving at such speeds that I feel like we need to get accustomed to a timing of now. And if it’s not now, it’s yesterday. So the timing is important, but the leeway has shrunk so significantly that I feel like we have to really just be present in the moment and just move as fast as we can because everybody else is moving at the highest speed. So timing is “now,” is what I would say. 

HUIZINGA: On that note, with so many innovations in AI coming out every day, every minute, what you’ve just expressed is that research horizons are shorter than ever. But as one of your team members noted in a recent panel, it still takes a lot of time to translate a research artifact, maybe a noteworthy finding or a published paper or an equation, an algorithm, into a useful product for humans. So how are you then dealing with these newly compressed timelines of “it needs to be done yesterday to keep up,” and how has the traditional research-to-product pipeline changed? 

HOSN: Yeah, it’s an awesome question. It is so true that going from research to a production-quality algorithm or capability takes time. But what I’m seeing is that the research-to-capabilities is accelerating, meaning if you look at the world today in generative AI and its surrounding, folks even in research are creating assets as they are creating their research. And so they are thinking as well, how do I showcase this? And of course, these assets are not production ready. But here’s the kicker. I think that the product teams are also adapting to this generative AI era, and they are changing to meet this disruptive moment. They are changing the way they think, and they are accelerating the way they productize and look at hardening and securing the assets so that they can put them in the hands of even a limited set of users just to get a feel of what it means to have them in the hands of end users and quickly iterating so that they can further harden and further improve the design until it’s production ready. And I feel like our product partners are meeting the moments, meaning they also are really adapting their processes such that they can get these assets and put them in the hands of users and test them out before they actually release them. 

HUIZINGA: Right. Let’s drill in a little bit more on that and talk about the traditional research-to-product pipeline, where you would have a researcher working on something and then an RSDE. What does RSDE stand for? 

HOSN: A research software development engineer. It’s a mouthful. 

HUIZINGA: Right. And then to the PM, or program manager, and then to the engineer. And you’ve said this provocative statement: now everyone is a PM! 

HOSN: Everyone is a PM! [LAUGHTER] 

HUIZINGA: What do you mean by that?

HOSN: I just, I just feel like if we are to meet the moment, we need to be thinking outside-in, inside-out simultaneously. And I believe that the spirit of program management, which is looking at the design from a user-centric perspective, is embedded as we are ideating, as we are trying to explore new methodologies, new algorithms, new assets. And so what has changed is that in the old days, we had the luxury of creating something, going and piloting for three months until we know whether it works or not, and then taking one year to productize! That … that, that doesn’t work anymore. [LAUGHTER] 

HUIZINGA: Right. 

HOSN: Because guess what? In three months, this innovation is, like, topped by four other innovations, be it at Microsoft or elsewhere. So that speed is really shifting the mindset and the, and the spirit of people. I have colleagues and friends, researchers, that are asking me, oh, scenarios, users … I mean it’s amazing to see. So, yes, everybody has gotten a little PM in them now. [LAUGHTER] 

HUIZINGA: Yeah, I did a podcast with Shamsi Iqbal and Jina Suh. And Shamsi was talking about this concept, this old concept, of the researcher being in their lab and saying, well, I’ve done this work; now go see what you want to do with it. I don’t think you have that affordance anymore as a researcher. 

HOSN: No … 

HUIZINGA: You’ve got to work much more tightly with other team members and think like a PM. 

HOSN: Totally. 

HUIZINGA: So let’s talk about how the big general idea behind AI Frontiers is giving birth to smaller, more specific ideas. What are some of the research directions and projects that you could tell us about that illustrate this vision here? 

HOSN: Yeah, and I’m sure you’ve heard some of it come from Ece Kamar as she spoke on this community that we have. In AI Frontiers, we’re exploring, I would say, three major areas of research. And I want you to imagine a stack. At the bottom of the stack, we’re asking ourselves questions around, what are some new architectures we can be thinking about for these foundational models? How do we create them? What kind of data we need to train them, to pre-train them. And then on top of that stack, which starts with a foundation model, we’re asking ourselves, OK great, you have a pretrained model. In a lot of cases, when you’re creating especially small models, you need to fine-tune them. So what is this methodology and data generation pipeline that we’re going to use to fine-tune these models and specialize them for both across domains and across skill set? And on top of that—so now we’re on the third layer—we have a final layer that encapsulates these models and orchestrates among them to allow them the ability to do, you know, complex tasks. And we don’t want to stop there because for us it’s … we don’t want to have an agent that just does things and doesn’t learn. So that learnability, that learning on the job, like we do as humans, is something we’re asking ourselves, as well. Like, how do we encapsulate these models? We orchestrate among them. And we allow these encapsulated things, we call them agents, to learn on the job so that they can accomplish more complex tasks. So those are the three things. And then cutting across these three layers, imagine there’s a thing that cuts across them, is doing everything in a way that allows us to rigorously evaluate and to ensure that we’re doing things in a safe and responsible way. So those are the main things that we’re working on. Does that make sense? 

HUIZINGA: That’s … yes, it does. And I imagine, you know, if you go to the website and you see those, kind of, three main areas, I imagine that even under there, there are specific projects on, you know, how then do we iterate? How then do we explore? 

HOSN: That’s right. That’s a good plug for people to visit the AI Frontiers website! Thank you, Gretchen! [LAUGHS] 

HUIZINGA: Well, I’ve been intrigued for a while by this idea of what you’ve called bi-directional enrichment, which represents both how research informs product but also how product informs research, but you’ve recently talked about how this idea has expanded to embrace what you call multi-directional enrichment and co-innovation. So what do you mean by that, and what does it look like for you? 

HOSN: So we talked just moments ago how the time has shrunk tremendously in artificial intelligence and the speed at which innovations are coming out. So what does that mean when you are sitting in research and you’re trying to derive value for Microsoft, for example? It means that now, rather than going on a journey to try out you know different things, what you want is for product to come on a co-innovation journey with you. And not every team has the capability or the time or the resources to do it. But sometimes product teams have applied scientists that are asking themselves very similar questions. And so now we have this huge synergistic effect by which, you know, researchers can come and explore their research but anchor them in a real-world scenario that the product team is, you know, asking themselves about. And that’s what I mean by co-innovation. And we look for co-innovation, so these are product teams or applied scientists in product teams that are not looking at something I can ship tomorrow. Because that’s not … that’s not frontiers. That’s feature-function that they can deliver right now to their customers. When we co-innovate, we have to co-innovate on a bit of a longer timespan. Now it’s no longer years, right? With generative AI, everything is months, but nonetheless, this is not next week. This is in a few months. And so … but this is really, really great because, again, I keep saying this and I have maybe a huge bias, but I do believe that research, without it being anchored in real-world scenario, just doesn’t have the same effect. So I have a bias for that. It’s my PM hat, what can I say? I love real-world scenarios! [LAUGHTER] 

HUIZINGA: What you just referred to is an interesting flow. I’ve noticed in my years doing this podcast that some people that started in research ended up over in product—and we’ll call them embedded researchers, if you will—and then some people that were in a product scenario come back over to research. And so, there’s this flow, multi-directional, bi-directional, and also where they’re placed within the company. How do you see that flow and the value of that flow between these organizations? 

HOSN: Yeah, you know, like, I think that the flow is important because that’s how cross-pollination happens. And you talked about brilliant minds. In product teams, there are brilliant minds, as well, right. And although their focus area is more around the product they live and breathe every day, this is enriching to researchers and continues to be enriching because when you deploy research capabilities in a real-world setting, there are surprising new research questions that come up, not just engineering. A lot of times people think of research, OK, yeah, you scale it, you harden it, you secure it, and it’s good to go. But that’s not always the case. In a lot of cases, because of the interactivity that happens with real-world scenarios, it opened up brand-new paths for research. And so I think that flow continues to happen even now. It’s just compressed. It’s just that researchers are no longer thinking six years. Researchers are thinking three months. Like, what am I going to do in three months? Because in three months, there will be a hundred other researchers that are coming up with innovation on the same question. So I think the flow still exists. I think that time has shrunk. And I think the mobility from researchers and research going to product and vice versa is enriching for the people that do it because you gain different perspectives. 

HUIZINGA: Well, and let’s push in even there a little bit. Researchers like everyone else can get comfortable looking at things through a particular lens. I would say that’s a human trait, not just a research trait … 

HOSN: Absolutely. 

HUIZINGA: … until a disruption challenges their status quo. So you’ve talked about LLMs, which we’ve called large language models, as being a good forcing function for researchers to think differently, even about the questions they’re asking. Can you elaborate on that a little bit? 

HOSN: Yeah, yeah, so, you know, the large language models and this disruption that we are living in at the moment is lighting fire underneath a lot of people’s intellect, I’m going to say. And so I think that people have to adapt quickly to change. And this is key. Adaptability, I believe, is just a key ingredient in doing research nowadays. Why? Because a lot of people are thinking directionally the same. And so, you know, if you’re not the first, you’re going to have to adapt to what came out. And then you have to think of, how do I differentiate? So the second point I would say is differentiation. And this mindset of, you know, how do I adapt to what just came out? How do I differentiate? And then—Rafah’s bias—how do I anchor in real-world scenario? This is the home run. And I would say you package all of this and focus, focus, focus … and you get a gold mine. 

HUIZINGA: I’m hearing “yes, and …” in this response in the sense of not everyone’s going to be first, but then, what else? This is back to the frontiers. It’s like, how do I differentiate? Yes, that’s awesome. And we’ve got this … 

HOSN: Exactly. And how do I build on what has just been discovered and give it a little bit of an edge or push it a little further or take it in a brand-new direction? I mean, so many different possibilities, but it does take adaptability, like a flexibility in the mindset, I would say. 

HUIZINGA: Yeah. Well, let’s go back to what you alluded to earlier, this idea of responsible AI. This is a big deal at Microsoft. And researchers are very thoughtful about the question of what could possibly go wrong if we got everything right. But how does that translate practically, and what concrete steps are you taking at what I’ll call the “frontier of responsibility?” 

HOSN: Yeah, and as I mentioned, you know, being at the frontiers is amazing. It also holds a big responsibility. We have so many different, I would say, checks and balances that we use, in model training and fine-tuning, to ensure that we are on top of all the regulatory, the policymaker suggestions, and we are abiding by Microsoft values first and foremost and responsibility in creating these innovations. So practically and tactically, what happens is that there are processes for how you actually even release any type of model. And this is just research. And when it goes to product, they have their own compliance, you know, a stricter even compliance, I would say, process that they go through. So we try, and I try particularly, to partner with our privacy champions, with our legal champions, with our people that are looking at this from a responsible AI perspective, so that we bring them in early on, and we say, hey, we’re thinking of doing this. And they tell us, well, you know, if you’re thinking about it this way, you might want to consider this. So we’re trying to bring them in as early as possible so that also we don’t go all the way and then we discover we did something wrong, so we have to backtrack. So I would say, you know, having these partners and colleagues come in early in the game just saves everybody a lot of time. And all this responsible AI for us, it’s ingrained with how we work, meaning we bring our champions early on and then we have them advise us as we move along the journey to create these innovations. So by the time we’re done, we know we’re good, right. And even by the time we’re done, we recheck everything, we run a lot of evaluation benchmarks, and, you know, we do the right thing per policies at Microsoft. So we take it very, very seriously. 

HUIZINGA: Well, let’s go back to this idea of research horizons for a second and anchor it in the way that we approach research. So many ideas are basically iterative steps on existing work, and they make a lot of sense … this is the next step … but then there are those out-of-the-box ideas that feel like maybe bigger swings—some might even call them outrageous—and in organizations like Microsoft Research, they might get the green light, too. Where do you find this idea of the outrageous or maybe longer-term idea finding a home or a place in an organization like Microsoft Research, and have you ever worked on something that felt outrageous to you? 

HOSN: Umm, you know, we like outrageous! That’s why we’re in research, right? So outrageous is good. I haven’t, to be honest, worked on an outrageous, but I am confident I will be. So … [LAUGHTER] I just have this belief that in AI Frontiers, we are going to have outrageous ideas, and we’re going to work on them, and we’re going to make bets that basically are hard to make in other parts of the company because we have the privilege of taking them and pursuing them. And, yes, they may fail, but if we have a breakthrough, it will be a significant breakthrough. So, so I think that outrageous is good. We need to think big. We need to take big leaps, big ideas. We also need to know how to fail gracefully and pivot fast! 

HUIZINGA: Hmmm. Mmm. You know, it strikes me, and I’m laughing to myself, it strikes me, even as we’re talking, that the idea that you work in AI Frontiers, that’s outrageous to most people and, and it’s normal to you. So maybe this idea of, “I haven’t worked on anything outrageous” is like, no, you live in outrageous, it just doesn’t seem like it! [LAUGHTER] 

HOSN: Maybe. It’s my day-to-day job, so, yes, I guess you’re right. 

HUIZINGA: Right. I mean, yeah, you say, we love outrageous, and that’s where it is right now. Every day that I follow, sort of, AI Twitter also and find myself going, seriously? That happened yesterday? What next? 

HOSN: Yeah, in two hours, there’ll be yet another thing. So, yeah, I guess I am living in outrageous, and I love it! It’s amazing! [LAUGHS] 

HUIZINGA: Yeah, maybe the idea of outrageous is just changed. 

HOSN: You know, you’re so right. I think that it’s become the norm. And it is, once we anchor in generative AI, and we push further on this idea, maybe we will go back in a cycle where outrageous is outrageous, but today it’s our life. It’s where we live. It’s what we breathe every day. So it’s become a norm. 

HUIZINGA: Yeah. Well, as we close, Rafah, I want to ask a question anchored on the big idea behind AI Frontiers. What do you believe might be true in say 10 to 15 years, and what should we be doing about it now? In other words, how does what we believe about the future influence how we conceptualize and execute on ideas today? 

HOSN: Yeah, you know, it’s … I can’t even predict what I’m going to be doing tomorrow! But … [LAUGHTER] here’s, here’s what I think. I think that we are truly approaching a moment in human history where a lot of unsurmountable problems, like very hard-to-tackle diseases that have been so hard, I think we are approaching a moment, you know, soon, I hope it’s even sooner than 10 years, where generative AI and innovations on top of it could lead to a lot of resolution for things that today … that cause unsurmountable pain and suffering. I’m very hopeful that with what we are creating that we can, you know, take inefficiencies out of so many different things that we see today that take time so that we liberate ourselves to think about the “what next” societally, right? I think what we need to be doing right now, to be honest, to influence the future is think about our curricula. What are we going to teach our kids? What are they going to work in? This is where I’m hoping that we pour some of our creativity, education system. How are we preparing the next generation? What are the paths that we are going to forge for them, knowing what we know today, knowing what this technology can bring forth? So my hope is that we put some brain power into that. 

HUIZINGA: Rafah Hosn, it’s always a pleasure to talk to you. A sincere pleasure, a delight. Thanks for joining us today on Ideas

[MUSIC PLAYS] 

HOSN: Thank you so much for having me, Gretchen. 

[MUSIC FADES]

The post Ideas: Exploring AI frontiers with Rafah Hosn appeared first on Microsoft Research.

Read More

SAMMO: A general-purpose framework for prompt optimization

SAMMO: A general-purpose framework for prompt optimization

SAMMO optimizer diagram showing progression from starting prompt to optimized prompt.

Large language models (LLMs) have revolutionized a wide range of tasks and applications that were previously reliant on manually crafted machine learning (ML) solutions, streamlining through automation. However, despite these advances, a notable challenge persists: the need for extensive prompt engineering to adapt these models to new tasks. New generations of language models like GPT-4 and Mixtral 8x7B advance the capability to process long input texts. This progress enables the use of longer inputs, providing richer context and detailed instructions to language models. A common technique that uses this enhanced capacity is the Retrieval Augmented Generation (RAG) approach. RAG dynamically incorporates information into the prompt based on the specific input example. This process is illustrated in Figure 1, which shows a RAG prompt designed to translate user queries into a domain-specific language (DSL), also known as semantic parsing. 

A table showing an example metaprompt for a semantic parsing task. The underlying metaprompt consists of three larger parts, each of which comes with a variety of aspects that can be optimized. For example, the input example can be rendered using different formats, the few shot example included can be retrieved using various similarity functions, or the task description can be paraphrased.
Figure 1: A RAG prompt is used for a semantic parsing task. The underlying prompt consists of three larger parts, each with a variety of aspects that can be optimized.

The example in Figure 1 combines three distinct structures to construct the final prompt. The first structure, the task description, remains static and independent of the input as a result of conventional prompt optimization techniques. However, RAG contains two input-specific structures: the example retriever and the input text itself. These introduce numerous optimization opportunities that surpass the scope of most traditional approaches. Despite previous efforts in prompt optimization, the evolution towards more complex prompt structures has rendered many older strategies ineffective in this new context. 

SAMMO: A prompt optimization approach 

To address these challenges, we developed the Structure-Aware Multi-objective Metaprompt Optimization (SAMMO) framework. SAMMO is a new open-source tool that streamlines the optimization of prompts, particularly those that combine different types of structural information like in the RAG example above. It can make structural changes, such as removing entire components or replacing them with different ones. These features enable AI practitioners and researchers to efficiently refine their prompts with little manual effort.

Central to SAMMO’s innovation is its approach to treating prompts not just as static text inputs but as dynamic, programmable entities—metaprompts. SAMMO represents these metaprompts as function graphs, where individual components and substructures can be modified to optimize performance, similar to the optimization process that occurs during traditional program compilation.

The following key features contribute to SAMMO’s effectiveness:

Structured optimization: Unlike current methods that focus on text-level changes, SAMMO focuses on optimizing the structure of metaprompts. This granular approach facilitates precise modifications and enables the straightforward integration of domain knowledge, for instance, through rewrite operations targeting specific stylistic objectives. 
 
Multi-objective search: SAMMO’s flexibility enables it to simultaneously address multiple objectives, such as improving accuracy and computational efficiency. Our paper illustrates how SAMMO can be used to compress prompts without compromising their accuracy.

General purpose application: SAMMO has proven to deliver significant performance improvements across a variety of tasks, including instruction tuning, RAG, and prompt compression.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


Exploring SAMMO’s impact through use cases 

Use case 1: RAG optimization 

A common application of LLMs involves translating natural user queries into domain-specific language (DSL) constructions, often to communicate with external APIs. For example, Figure 1 shows how an LLM can be used to map user queries about geography facts to a custom DSL.

In a realistic RAG scenario, SAMMO demonstrates significant performance improvements. To demonstrate this, we conducted experiments across three semantic parsing datasets of varying complexity: GeoQuery, SMCalFlow, and Overnight. Given the often limited availability of data in practical settings, we trained and tested the model on a subsampled dataset (training and retrieval set n=600, test set n=100). We compared SAMMO against a manually designed competitive baseline, using enumerative search within a search space of 24 configurations. This included variations in data formats, the number of few-shot examples, and DSL specifications.  

Evaluation 

As illustrated in Figure 2, SAMMO improved accuracy across different datasets and backend LLMs in almost all cases, with the most notable gains observed in older generation models. However, even newer models like GPT-4, SAMMO facilitated accuracy improvements exceeding 100 percent.

A series of four bar charts showing the performance of SAMMO on semantic parsing tasks. SAMMO achieves substantial improvements for most backend models and datasets.
Figure 2: For semantic parsing with RAG, SAMMO achieves substantial improvements across most backend models and datasets. 

Use case 2: Instruction tuning 

Instruction tuning addresses the optimization of static instructions given to LLMs that provide the goal and constraints of a task. To show that SAMMO extends beyond many previous prompt tuning methods, we applied this conventional setting.

To align with previous research, we used eight zero-shot BigBench classification tasks where the baseline prompt for GPT-3.5 achieved an accuracy of less than 0.9. We compared it against Automatic Prompt Optimization (APO) and GrIPS, applying open-source models Mixtral 7x8B and Llama-2 70B, alongside GPT-3.5 as backend LLMs. We did not include GPT-4 due to minimal improvement potential identified in pilot experiments. The results, shown in Figure 3, demonstrate that SAMMO outperformed all baselines regardless of the backend model, proving its effectiveness with even more complex metaprompts.

A series of three bar charts comparing the accuracy of different methods on instruction tuning. SAMMO matches or exceeds the performance of competing methods for instruction tuning on classification tasks.
Figure 3: SAMMO does at least as well as older methods for instruction tuning on simpler tasks.

Implications and looking forward

SAMMO introduces a new and flexible approach to optimize prompts for specific requirements. Its design works with any LLM, and it features versatile components and operators suitable for a broad range of applications.

We are excited to integrate and apply SAMMO to the components and pipelines behind AI-powered assistant technologies. We also hope to establish a user-driven community centered around SAMMO, where people can exchange best practices and patterns, and encourage the expansion of the existing set of search operators.

The post SAMMO: A general-purpose framework for prompt optimization appeared first on Microsoft Research.

Read More