Abstracts: May 6, 2024

Abstracts: May 6, 2024

Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Researcher Michel Galley joins host Gretchen Huizinga to discuss “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts,” which was accepted at the 2024 International Conference on Learning Representations (ICLR). MathVista, an open-source benchmark, combines new and existing data to measure how good models are at solving a variety of math problems that involve processing images as well as text, helping to gain insight into their reasoning capabilities.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

My guest today is Dr. Michel Galley, a senior principal researcher at Microsoft Research. Dr. Galley is the coauthor of a paper called “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.” Michel, thanks for joining us on Abstracts today!


MICHEL GALLEY: Thank you for having me.

HUIZINGA: So I like to start with a distillation or sort of an elevator pitch of your research. Tell us in just a couple sentences what problem or issue your paper addresses and why we should care about it.

GALLEY: So this paper is about evaluating large foundation models. So it’s a very important part of researching large language models because it’s a good way to evaluate, kind of, the capabilities—what these models are good at and not good at. And a part of the focus of MathVista is to evaluate these large foundation models in a multimodal setup, so when the input to the model is actually not just text but also text and images. And then, an example of a task that such a model would perform is, like, the input is maybe a mathematical question, and then there’s some visual support to that question, let’s say, of an image of a graph, and then the model has to respond to something related to that. And why this is important … there has been a lot of work, of course, on large foundation model. Especially when it comes to reasoning tasks, like mathematical reasoning, a lot has focused more on written form.

HUIZINGA: Yeah …

GALLEY: So MathVista is one of the very first datasets that has input that is both images and text.

HUIZINGA: Yeah, yeah. Well, reading your paper, it seems like this is an area that hasn’t been studied systematically. In fact, you actually say that! And say that the field is largely unexplored. But quickly tell us what has been done in this field, and then tell us how your research addresses the proverbial gap in the literature.

GALLEY: Well, there has been a lot of work on vision and language in other problems, like not just about reasoning. Maybe let me just mention why reasoning is important. So one reason I think it’s very interesting to evaluate these large language models in terms of reasoning skill is that we evaluate their capabilities beyond just memorization. So as many of your listeners probably know, these large foundation models are trained on large amounts of text that is public data from various sources. So when you ask a question to a large foundation model, it could be the case, in many cases, that it just memorizes things it has seen in the data.

HUIZINGA: Sure.

GALLEY: So what makes it interesting in terms of reasoning, the answer oftentimes is not there in the data. So it needs to develop this ability to connect the dots between various pieces of information to come up with a new answer. So the focus of our paper is really on mathematical reasoning, but it goes also a bit beyond that because what is also represented in the data is also science question and so on.

HUIZINGA: Yeah …

GALLEY: So this reasoning part has largely focused, until MathVista, on text-only modalities.

HUIZINGA: Yeah …

GALLEY: So it’s one of our very first ones that combines text and images in terms of evaluating these large foundation models. So you ask about what was done before. So, yes, there has been a lot of work, text only, on reasoning, for example, the mathematical question that’s just based on text. And there has been a different stream of work that was much more focused on vision. A lot of work has been on tasks such as visual question answering …

HUIZINGA: Yeah …

GALLEY: … where basically, you have an image and the question is about answer a question about this image. So, yes, we’re trying to fuse the two lines of research here.

HUIZINGA: Right …

GALLEY: And that’s one of the first works that does that.

HUIZINGA: Yeah. Well, let’s talk about your methodology for a minute. Tell us how you went about conducting this research, and what methods did you use?

GALLEY: Yes, sure. So that’s a bit different from a typical, kind of, machine learning paper because the focus on this work is really on benchmarking on the dataset. So the methodology is more about how we collect the data, process it. So they have two components to doing that. One was to look at existing data that already combines vision and text. And there are existing datasets that are actually already fairly big but that were not focused on reasoning. So we use those existing datasets and look for instances in the data that actually include some mathematical or science reasoning. And so that part is leveraging existing datasets, but the important part is, like, we really want to carve out what was interesting piece in terms of reasoning. And we had different stages of processing the data to identify the subset that was reasoning-based. So one first step was basically to apply some automatic filter to determine whether or not a given example, let’s say something that is visual and text, is actually … involves some mathematical reasoning. So we have different strategy. For example, if the answer is numerical, it’s likely that it might be something mathematically related. But that’s just the first stage. And the second stage, we actually had humans, annotators, just certify that the selected data is actually of high quality. So we do have an example of, “Oh, this is mathematical, and that’s either mathematical or scientific,” and so on. And that’s one part of the effort. The other part is that we realized while we collected the data, there are certain types of mathematical reasoning or related to mathematical reasoning that were not represented in the data. So we created three new datasets as part of MathVista. So when I said dataset, it’s more like, think of MathVista as like an aggregate of different types of data, and we added three of them, three new types of data. One is what you call PaperQA, which is basically data that is collected from scientific papers on arXiv, and that had questions asking about that paper and that included some visual components from the paper, typically a plot or a figure.

HUIZINGA: Yeah …

GALLEY: And then we had IQTest, which is basically, I mean, it’s vaguely related mathematically, but basically it also, kind of, tried to see maybe more abstractive thinking about maybe some input that is both text and visual. And the final is about FunctionQA, that is basically algebraic reasoning and function plots and so on.

HUIZINGA: OK …

GALLEY: The important part was actually to identify among vast amounts of data what is actually very interesting in terms of mathematical reasoning.

HUIZINGA: Yeah …

GALLEY: So that part, I think, was quite a big part of doing that work—finding existing data but also creating new data.

HUIZINGA: Yeah, yeah. Well, my favorite part of a research paper is where it says, “and what we found was … ,” so talk a little bit about your results. What did you find?

GALLEY: So we evaluated a wide variety of models, including GPT-4, Claude 2, GPT-4V, multimodal Bard, and LLaVA, and we categorized them into three categories. So one is text only. So, basically, you take a model that is by default just text, and we give it the text part of the question and ask it to answer the question. Of course, that’s, kind of, a bit of a, it’s a difficult task because oftentimes [LAUGHTER] we crucially build these questions so that you have to rely on the vision part. But that’s for, you know, scientific investigation to know how well they can do, and so that’s one category of model. A different category is still text only but that is given the detection from the image. So on the image, we do OCR. So we convert those words from images to text. It’s kind of an extension of the text-based model, except that what was images is translated into text, and then the input to the model is word only, and that’s a different category of model. And the third one is basically truly multimodal model. And what we found, I mean, not surprisingly, it’s, kind of, the one that was doing most poorly is the one that is text only. The second is text plus OCR. And then finally, the one that does best is the multimodal like GPT-4V. But while the ordering between these three categories makes sense, it was a bit surprising that maybe the gap between multimodal and text plus OCR was not bigger. Well, it’s big, but maybe not as big as we were expecting. So, for example, the best detection from the images model achieved like 35 percent accuracy while GPT-4V was 50 percent. So it’s a substantial gap but not huge.

HUIZINGA: Right. Just to clarify, you’re saying OCR. What does that stand for?

GALLEY: [Optical] character recognition.

HUIZINGA: Gotcha.

GALLEY: So, basically, it’s the task of taking text, sometimes typed, but sometimes written, and convert this into the actual text like you would have in a text file.

HUIZINGA: Right. Michel, does any of this have to do with the difficulty of the math problems that you present these models with? I mean, it seems to me, similar to humans, that the easier the problem, the easier it would be for the machine. So at what level of math are we talking for these tests?

GALLEY: What’s nice about MathVista is there’s continuum [of] different difficulties. So the spectrum is quite broad, going from elementary school to more advanced concepts such as calculus. So it’s quite broad. So in the paper, we do have this, kind of, broken down by level. So the number I gave you, like 50 percent, is an aggregate over all the difficulties. But …

HUIZINGA: Gotcha.

GALLEY: But the goal there was really, kind of, to compare different models, but we do have a fair amount of analysis in the appendix. Actually, we have 100 pages of appendices of plenty of analysis and so on. So if people, I mean …

HUIZINGA: I saw that. I saw the length of the paper, and I’m going, what? [LAUGHS] That’s a LONG paper! Well, research in the lab is one thing, I always like to say, but understanding real-world impact is important, too. So where’s this work going to make the most difference, and who does it help most at this point?

GALLEY: Well, I think perhaps that’s the main point of this kind of line of work in terms of reasoning is that when looking at this difficult problem that are mathematical, actually it’s a way to, kind of, abstract away maybe more complex capabilities, and I think while thinking just about mathematics might seem a bit narrow, I don’t think that really is. It’s more about seeing whether this model has the ability to do, kind of, multistep kind of processing of your input and think maybe somewhat intelligently about a given problem. So we focus mostly on math. There is some science, but we would be very interested, especially in future work, to, kind of, go beyond that.

HUIZINGA: OK, well, let me press in a little bit there because … just say I’m a regular person using a GPT model. Is your work more addressed upstream from that to the research community to say, how do we get these models to be better so that downstream people like me can be more confident of the models?

GALLEY: Yes, I would say at the moment, I mean, this line of work is perhaps more geared towards somewhat more research community, but I think it could be some seed for researchers to think about some applications perhaps that also requires some kind of step-by-step reasoning but perhaps not going beyond math.

HUIZINGA: Yeah. Michel, if there was one thing you wanted our listeners to take away from this research, kind of golden nugget, what would it be?

GALLEY: Well, I would say it’s the challenging part of these datasets. I think that’s what makes MathVista stand out compared to other datasets. By now, there are a few other vision and language datasets, and of course, many that are more text-based. And we’ve seen, for example, some recent papers showing that actually MathVista remains one of the most challenging ones. So I think it’s probably going to stay around for a while because of the difficulty it represents. So it’s open source of available datasets that everybody can use, and I very much encourage people to use it.

HUIZINGA: Is it on GitHub?

GALLEY: Yes, it’s on GitHub.

HUIZINGA: So what’s next on the research agenda for helping LLMs get better at math, Michel? What are the big challenges in the field yet? I mean, you’ve alluded to many of them already, sort of, but what’s next on your research agenda?

GALLEY: Well, I would say what we found so far is these models are very good at processing the textual part of problems it’s given, to the model, but you have the equivalent in images actually harder somehow. So I think a lot more work needs to be done in terms of vision capabilities, in terms of reasoning over images, because the capabilities you will see in text are actually quite advanced, whereas the equivalent in images doesn’t seem that good. I mean, a fair disclaimer: my background is more on the text side, [LAUGHTER] so some of my colleagues on the paper are more on the vision side, so maybe if a listener maybe run into some of our coauthors at the conference, they might want to talk to these vision people because that’s less of my background. [LAUGHS]

HUIZINGA: Well, and if you think about Venn diagrams, you know, you’ve got people that are doing text, people that are doing vision, and then the people that are trying to do both to see how the worlds collide.

[MUSIC]

Well, Michel Galley, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find it on arXiv. You can also read it on the website for the International Conference on Learning Representations, or ICLR. And if you happen to be at the ICLR conference this week, you can hear more about it there. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: May 6, 2024 appeared first on Microsoft Research.

Read More

Research Focus: Week of April 29, 2024

Research Focus: Week of April 29, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: Week of April 29, 2024

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a program’s intent. However, there is no guarantee that a program’s implementation aligns with its natural language documentation. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. However, this information is often underutilized, due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The “emergent abilities” of large language models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, due to a lack of benchmarks and evaluation metrics, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent—and whether such translation could be useful in practice.

In a new paper: Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? (opens in new tab), researchers from Microsoft describe nl2postcond, the problem leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. The paper, to be presented at the upcoming ACM International Conference on the Foundations of Software Engineering (opens in new tab), introduces and validates metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. The researchers show that nl2postcond via LLMs has the potential to be helpful in practice by demonstrating that LLM-generated specifications can be used to discover historical bugs in real-world projects. 


Semantically Aligned Question and Code Generation for Automated Insight Generation

People who work with data, like engineers, analysts, and data scientists, often must manually look through data to find valuable insights or write complex scripts to automate exploration of the data. Automated insight generation provides these workers the opportunity to immediately glean insights about their data and identify valuable starting places for writing their exploration scripts. Unfortunately, automated insights produced by LLMs can sometimes generate code that does not correctly correspond (or align) to the insight. In a recent paper: Semantically Aligned Question and Code Generation for Automated Insight Generation (opens in new tab), researchers from Microsoft leverage the semantic knowledge of LLMs to generate targeted and insightful questions about data and the corresponding code to answer those questions. Through an empirical study on data from Open-WikiTable (opens in new tab), they then show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. The research also shows that generating questions and code together yields more interesting and diverse insights about data. 


Explaining CLIP’s performance disparities on data from blind/low vision users

AI-based applications hold the potential to assist people who are blind or low vision (BLV) with everyday visual tasks. However, human assistance is often required, due to the wide variety of assistance needed and varying quality of images available. Recent advances in large multi-modal models (LMMs) could potentially address these challenges, enabling a new era of automated visual assistance. Yet, little work has been done to evaluate how well LMMs perform on data from BLV users.

In a recent paper: Explaining CLIP’s performance disparities on data from blind/low vision users (opens in new tab), researchers from Microsoft and the World Bank address this issue by assessing CLIP (opens in new tab), a widely-used LMM with potential to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, their results show that disability objects, like guide canes and Braille displays, are recognized significantly less accurately than common objects, like TV remote controls and coffee mugs—in some cases by up to 28 percentage points difference. 

The researchers perform an analysis of the captions in three large-scale datasets that are commonly used to train models like CLIP and show that BLV-related content (such as guide canes) is rarely mentioned. This is a potential reason for the large performance gaps. The researchers show that a few-shot learning approach with as little as five example images of a disability object can improve its ability to recognize that object, holding the potential to mitigate CLIP’s performance disparities for BLV users. They then discuss other possible mitigations. 

Microsoft Research Podcast

AI Frontiers: Models and Systems with Ece Kamar

Ece Kamar explores short-term mitigation techniques to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 


Closed-Form Bounds for DP-SGD against Record-level Inference 

Privacy of training data is a central consideration when deploying machine learning (ML) models. Models trained with guarantees of differential privacy (DP) provably resist a wide range of attacks. Although it is possible to derive bounds, or safe limits, for specific privacy threats solely from DP guarantees, meaningful bounds require impractically small privacy budgets, which results in a large loss in utility.
 
In a recent paper: Closed-Form Bounds for DP-SGD against Record-level Inference, researchers from Microsoft present a new approach to quantify the privacy of ML models against membership inference (inferring whether a data record is in the training data) and attribute inference (reconstructing partial information about a record) without the indirection through DP. They focus on the popular DP-SGD algorithm, which they model as an information theoretic channel whose inputs are the secrets that an attacker wants to infer (e.g., membership of a data record) and whose outputs are the intermediate model parameters produced by iterative optimization. They obtain closed-form bounds for membership inference that match state-of-the-art techniques but are orders of magnitude faster to compute. They also present the first algorithm to produce data-dependent bounds against attribute inference. Compared to bounds computed indirectly through numerical DP budget accountants, these bounds provide a tighter characterization of the privacy risk of deploying an ML model trained on a specific dataset. This research provides a direct, interpretable, and practical way to evaluate the privacy of trained models against inference threats without sacrificing utility.

Microsoft Research in the news


TIME100 Most Influential People in Health 

TIME | May 2, 2024

Microsoft Research president Peter Lee is included as an innovator on the 2024 TIME100 Health list, TIME’s inaugural list of 100 individuals who most influenced global health this year.


Sanctuary AI Announces Microsoft Collaboration to Accelerate AI Development for General Purpose Robots 

Sanctuary AI | May 1, 2024

Sanctuary AI and Microsoft are collaborating on the development of AI models for general purpose humanoid robots. Sanctuary AI will leverage Microsoft’s Azure cloud resources for their AI workloads.


Tiny but mighty: The Phi-3 small language models with big potential 

Microsoft Source | April 23, 2024

LLMs create exciting opportunities for AI to boost productivity and creativity. But they require significant computing resources. Phi-3 models, which perform better than models twice their size, are now publicly available from Microsoft.


AI Is Unearthing New Drug Candidates, But It Still Needs Human Oversight 

Drug Discovery Online | April 11, 2024

Drug Discovery Online published a contributed article from Junaid Bajwa discussing how recent advancements in AI offer the potential to streamline and optimize drug development in unprecedented ways.


How AI is helping create sustainable farms of the future 

The Grocer | April 16, 2024

Ranveer Chandra authored an essay on how AI is helping create sustainable farms of the future for UK-based trade outlet, The Grocer.


The Future of AI and Mental Health 

Psychiatry Online | April 16, 2024

Psychiatric News published an article featuring Q&A with Jina Suh, highlighting the important considerations for the use of AI technologies among psychiatrists and mental health professionals.


MatterGen’s Breakthroughs: How AI Shapes the Future of Materials Science 

Turing Post | April 19, 2024

Turing Post covered MatterGen in an interview with Tian Xie. Learn more about this impactful generative model for inorganic materials design.


Machine Learning Street Talk interview with Chris Bishop 

Machine Learning Street Talk | April 10, 2024

Chris Bishop joined Dr. Tim Scarfe for a wide-ranging interview on advances in deep learning and AI for science.

The post Research Focus: Week of April 29, 2024 appeared first on Microsoft Research.

Read More

Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications

Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications

ASPLOS 2024 logo in white on a blue and green gradient background

Modern computer systems and applications, with unprecedented scale, complexity, and security needs, require careful co-design and co-evolution of hardware and software. The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (opens in new tab), is the main forum where researchers bridge the gap between architecture, programming languages, and operating systems to advance the state of the art.

ASPLOS 2024 is taking place in San Diego between April 27 and May 1, and Microsoft researchers and collaborators have a strong presence, with members of our team taking on key roles in organizing the event. This includes participation in the program and external review committees and leadership as the program co-chair.

We are pleased to share that eight papers from Microsoft researchers and their collaborators have been accepted to the conference, spanning a broad spectrum of topics. In the field of AI and deep learning, subjects include power and frequency management for GPUs and LLMs, the use of Process-in-Memory for deep learning, and instrumentation frameworks. Regarding infrastructure, topics include memory safety with CHERI, I/O prefetching in modern storage, and smart oversubscription of burstable virtual machines. This post highlights some of this work.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience


Paper highlights

Characterizing Power Management Opportunities for LLMs in the Cloud

The rising popularity of LLMs and generative AI has led to an unprecedented demand for GPUs. However, the availability of power is a key limiting factor in expanding a GPU fleet. This paper characterizes the power usage in LLM clusters, examines the power consumption patterns across multiple LLMs, and identifies the differences between inference and training power consumption patterns. This investigation reveals that the average and peak power consumption in inference clusters is not very high, and that there is substantial headroom for power oversubscription. Consequently, the authors propose POLCA: a framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. It can deploy 30% more servers in the same GPU clusters for inference tasks, with minimal performance degradation.

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

PIM-DL is the first deep learning framework specifically designed for off-the-shelf processing-in-memory (PIM) systems, capable of offloading most computations in neural networks. Its goal is to surmount the computational limitations of PIM hardware by replacing traditional compute-heavy matrix multiplication operations with Lookup Tables (LUTs). PIM-DL first enables neural networks to operate efficiently on PIM architectures, significantly reducing the need for complex arithmetic operations. PIM-DL demonstrates significant speed improvements, achieving up to ~37x faster performance than traditional GEMM-based systems and showing competitive speedups against CPUs and GPUs.

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety

Memory safety bugs have persistently plagued software for over 50 years and underpin some 70% of common vulnerabilities and exposures (CVEs) every year. The CHERI capability architecture (opens in new tab) is an emerging technology (opens in new tab) (especially through Arm’s Morello (opens in new tab) and Microsoft’s CHERIoT (opens in new tab) platforms) for spatial memory safety and software compartmentalization. In this paper, the authors demonstrate the viability of object-granularity heap temporal safety built atop CHERI with considerably lower overheads than prior work.

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines

Burstable virtual machines (BVMs) are a type of virtual machine in the cloud that allows temporary increases in resource allocation. This paper shows how to oversubscribe BVMs. It first studies the characteristics of BVMs on Microsoft Azure and explains why traditional approaches based on using a fixed oversubscription ratio or based on the Central Limit Theorem do not work well for BVMs: they lead to either low utilization or high server capacity violation rates. Based on the lessons learned from the workload study, the authors developed a new approach, called AUDIBLE, using a nonparametric statistical model. This makes the approach lightweight and workload independent. This study shows that AUDIBLE achieves high system utilization while enforcing stringent requirements on server capacity violations.

Complete list of accepted publications by Microsoft researchers

Amanda: Unified Instrumentation Framework for Deep Neural Networks
Yue Guan, Yuxian Qiu, and Jingwen Leng; Fan Yang, Microsoft Research; Shuo Yu, Shanghai Jiao Tong University; Yunxin Liu, Tsinghua University; Yu Feng and Yuhao Zhu, University of Rochester; Lidong Zhou, Microsoft Research; Yun Liang, Peking University; Chen Zhang, Chao Li, and Minyi Guo, Shanghai Jiao Tong University

AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines
Seyedali Jokar Jandaghi and Kaveh Mahdaviani, University of Toronto; Amirhossein Mirhosseini, University of Michigan; Sameh Elnikety, Microsoft Research; Cristiana Amza and Bianca Schroeder, University of Toronto, Cristiana Amza and Bianca Schroeder, University of Toronto

Characterizing Power Management Opportunities for LLMs in the Cloud
(opens in new tab)
Pratyush Patel, Microsoft Azure and University of Washington; Esha Choukse (opens in new tab), Chaojie Zhang (opens in new tab), and Íñigo Goiri (opens in new tab), Azure Research; Brijesh Warrier (opens in new tab), Nithish Mahalingam, Ricardo Bianchini (opens in new tab), Microsoft AzureResearch

Cornucopia Reloaded: Load Barriers for CHERI Heap Temporal Safety
Nathaniel Wesley Filardo, University of Cambridge and Microsoft Research; Brett F. Gutstein, Jonathan Woodruff, Jessica Clarke, and Peter Rugg, University of Cambridge; Brooks Davis, SRI International; Mark Johnston, University of Cambridge; Robert Norton, Microsoft Research; David Chisnall, SCI Semiconductor; Simon W. Moore, University of Cambridge; Peter G. Neumann, SRI International; Robert N. M. Watson, University of Cambridge

CrossPrefetch: Accelerating I/O Prefetching for Modern Storage
Shaleen Garg and Jian Zhang, Rutgers University; Rekha Pitchumani, Samsung; Manish Parashar, University of Utah; Bing Xie, Microsoft; Sudarsun Kannan, Rutgers University

Kimbap: A Node-Property Map System for Distributed Graph Analytics
Hochan Lee, University of Texas at Austin; Roshan Dathathri, Microsoft Research; Keshav Pingali, University of Texas at Austin

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization
Cong Li and Zhe Zhou, Peking University; Yang Wang, Microsoft Research; Fan Yang, Nankai University; Ting Cao and Mao Yang, Microsoft Research; Yun Liang and Guangyu Sun, Peking University

Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs
Srikant Bharadwaj, Microsoft Research; Shomit Das, Qualcomm; Kaushik Mazumdar and Bradford M. Beckmann, AMD; Stephen Kosonocky, Uhnder

Conference organizers from Microsoft

Program Co-Chair

Madan Musuvathi

Submission Chairs

Jubi Taneja
Olli Saarikivi

Program Committee

Abhinav Jangda (opens in new tab)
Aditya Kanade (opens in new tab)
Ashish Panwar (opens in new tab)
Jacob Nelson (opens in new tab)
Jay Lorch (opens in new tab)
Jilong Xue (opens in new tab)
Paolo Costa (opens in new tab)
Rodrigo Fonseca (opens in new tab)
Shan Lu (opens in new tab)
Suman Nath (opens in new tab)
Tim Harris (opens in new tab)

External Review Committee

Rujia Wang

Career opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Research, and other departments. We are always pushing the boundaries of computer systems to improve the scale, efficiency, and security of all our offerings. You can review our open research-related positions here.

The post Microsoft at ASPLOS 2024: Advancing hardware and software for high-scale, secure, and efficient modern applications appeared first on Microsoft Research.

Read More

SIGMA: An open-source mixed-reality system for research on physical task assistance

SIGMA: An open-source mixed-reality system for research on physical task assistance

Blue, purple, pink gradient background with three images: a five item checklist on the left, a sound wave in the middle, and goggles on the right.

Imagine if every time you needed to complete a complex physical task, like building a bicycle, fixing a broken water heater, or cooking risotto for the first time, you had a world-class expert standing over your shoulder and guiding you through the process. In addition to telling you the steps to follow, this expert would also tune the instructions to your skill set, deliver them with the right timing, and adapt to any mistakes, confusions, or distractions that might arise along the way. 

What would it take to build an interactive AI system that could assist you with any task in the physical world, just as a real-time expert would? To begin exploring the core competencies that such a system would require, we developed and released the Situated Interactive Guidance, Monitoring, and Assistance (SIGMA) system, an open-source research platform and testbed prototype (opens in new tab) for studying mixed-reality task assistance. SIGMA provides a basis for researchers to explore, understand, and develop the capabilities required to enable in-stream task assistance in the physical world. 

Left: Stock photo of a man with glasses fixing a bicycle. Middle: Stock photo of a man cooking a meal in a kitchen. Right: Stock photo of a woman fixing the plumbing of a kitchen sink with a wrench while lying on the floor with other tools scattered around.

Recent advances in generative AI and large language, vision, and multimodal models can provide a foundation of open-domain knowledge, inference, and generation capabilities to help enable such open-ended task assistance scenarios. However, building AI systems that collaborate with people in the physical world—including not just mixed-reality task assistants but also interactive robots, smart factory floors, autonomous vehicles, and so on—requires going beyond the ability to generate relevant instructions and content. To be effective, these systems also require physical and social intelligence. 

Physical and social intelligence

For AI systems to fluidly collaborate with people in the physical world, they must continuously perceive and reason multimodally, in stream, about their surrounding environment. This requirement goes beyond just detecting and tracking objects. Effective collaboration in the physical world necessitates an understanding of which objects are relevant for the task at hand, what their possible uses may be, how they relate to each other, what spatial constraints are in play, and how all these aspects evolve over time. 

Just as important as reasoning about the physical environment, these systems also need to reason about people. This reasoning should include not only lower-level inferences about body pose, speech and actions, but also higher-level inferences about cognitive states and the social norms of real-time collaborative behavior. For example, the AI assistant envisioned above would need to consider questions such as: Is the user confused or frustrated? Are they about to make a mistake? What’s their level of expertise? Are they still pursuing the current task, or have they started doing something else in parallel? Is it a good time to interrupt them or provide the next instruction? And so forth.

Situated Interactive Guidance, Monitoring, and Assistance

We developed SIGMA as a platform to investigate these challenges and evaluate progress in developing new solutions.

Left: A person using SIGMA running on a HoloLens 2 to perform a procedural task. Middle: First-person view showing SIGMA’s task-guidance panel and task-specific holograms. Right: 3D visualization of the system's scene understanding showing the egocentric camera view, depth map, detected objects, gaze, hand and head pose.
Left: A person using SIGMA running on a HoloLens 2 to perform a procedural task. Middle: First-person view showing SIGMA’s task-guidance panel and task-specific holograms. Right: 3D visualization of the system’s scene understanding showing the egocentric camera view, depth map, detected objects, gaze, hand and head pose.

SIGMA is an interactive application that currently runs on a HoloLens 2 device and combines a variety of mixed-reality and AI technologies, including large language and vision models, to guide a user through procedural tasks. Tasks are structured as a sequence of steps, which can either be predefined manually in a task library or generated on the fly using a large language model like GPT-4. Throughout the interaction, SIGMA can leverage large language models to answer open-ended questions that a user might have along the way. Additionally, SIGMA can use vision models like Detic and SEEM to detect and track task-relevant objects in the environment and point them out to the user as appropriate. This video (opens in new tab) provides a first-person view of someone using SIGMA to perform a couple of example procedural tasks.

Enabling research at the intersection of AI and mixed reality

SIGMA was designed to serve as a research platform. Our goal in open-sourcing the system is to help other researchers leapfrog the basic engineering challenges of putting together a full-stack interactive application and allow them to directly focus on the interesting research challenges ahead.

Several design choices support these research goals. For example, the system is implemented as a client-server architecture: a lightweight client application runs on the HoloLens 2 device (configured in Research Mode (opens in new tab)), which captures and sends a variety of multimodal data streams—including RGB (red-green-blue), depth, audio, head, hand, and gaze tracking information—live to a more powerful desktop server. The desktop server implements the core functionality of the application and streams information and commands to the client app for what to render on the device. This architecture enables researchers to bypass current compute limitations on the headset and creates opportunities for porting the application to other mixed-reality devices. 

SIGMA is built on top of Platform for Situated Intelligence (opens in new tab) (also known as psi), an open-source framework that provides the fabric, tools, and components for developing and researching multimodal integrative-AI systems. The underlying psi framework enables fast prototyping and provides a performant streaming and logging infrastructure. The framework provides infrastructure for data replay, enabling data-driven development and tuning at the application level. Finally, Platform for Situated Intelligence Studio provides extensive support for visualization, debugging, tuning and maintenance. 

An animated gif depicting the Platform for Situated Intelligence Studio visualization tool. Various 2D, 3D, and timeline streams are shown over a 10-second clip of a user interacting with SIGMA, such as the egocentric camera view, depth map, head pose, audio, speech recognition results, etc.
Platform for Situated Intelligence Studio is a tool that enables researchers to visualize various data streams collected and debug the application.

SIGMA’s current functionality is relatively simple, but the system provides an important starting point for discovering and exploring research challenges at the intersection of mixed reality and AI. From computer vision to speech recognition, many research problems, especially when it comes to perception, can and have been investigated based on collected datasets. The recently increased interest in egocentric data and associated challenges provides important fuel for advancing the state of the art. Yet, numerous problems that have to do with interaction and with real-time collaboration are only surfaced by real-time end-to-end systems and are best studied and understood in an interactive context with actual users.

As a testament to Microsoft’s continued commitment to the space, SIGMA provides a research platform and reflects just one part of the company’s work to explore new AI and mixed-reality technologies. Microsoft also offers an enterprise-ready, mixed-reality solution for frontline workers: Dynamics 365 Guides. With Copilot in Dynamics 365 Guides, which is currently being used by customers in private preview, AI and mixed reality together empower frontline workers with step-by-step procedural guidance and relevant information in the flow of work. Dynamics 365 Guides is a richly featured product for enterprise customers, geared toward frontline workers who perform complex tasks. In comparison, SIGMA is an open-source testbed for exploratory research purposes only. 

We hope that SIGMA can provide a solid foundation for researchers to build on. Although the system targets the specific scenario of mixed-reality task assistance, it can help illuminate the challenges of developing social and physical intelligence that arise for any computing systems that are designed to operate in the physical world and interact with people, from virtual agents to physical robots and devices.

If you are interested in learning more and using SIGMA in your own research, check it out at https://aka.ms/psi-sigma (opens in new tab). We are excited to collaborate with and work alongside the open-source research community to make faster progress in this exciting and challenging space. 

Acknowledgements / Contributors

Ishani Chakraborty, Neel Joshi, Ann Paradiso, Mahdi Rad, Nick Saw, Vibhav Vineet, Xin Wang.

Responsible AI considerations

SIGMA was designed as an experimental prototype for research purposes only and is not intended for use in developing commercial applications. The primary use case is as a research tool to enable academic and industry researchers to push the state of the art in the space of procedural task assistance at the intersection of mixed reality and AI. As such, the system has been open-sourced under a research-only license (opens in new tab). Researchers that wish to make use of SIGMA in their own work should first familiarize themselves with the system and its limitations and risks involved with using the system in a user-study context and should undergo a full IRB or ethical board review as appropriate for their institution. Limitations, risks and additional considerations for using the system are described in a Transparency Note (opens in new tab) available in SIGMA’s open-source repository (opens in new tab).

The post SIGMA: An open-source mixed-reality system for research on physical task assistance appeared first on Microsoft Research.

Read More

Ideas: Exploring AI frontiers with Rafah Hosn

Ideas: Exploring AI frontiers with Rafah Hosn

Microsoft Research Podcast: Ideas - Rafah Hosn

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. 

In this episode, host Gretchen Huizinga talks with Rafah Hosn, partner, group product manager for AI Frontiers at Microsoft Research. Hosn’s professional experience spans the gamut—from research to product to engineering to research again, the discipline’s uniquely high levels of creativity, curiosity, and intellect drawing her back in. Energized by past technical disruptions she’s experienced, Hosn is on what she describes as her “most exciting adventure” yet, helping to drive scientific advancement in AI and to answer a big question: how far can we push machine intelligence while still delivering technologies people can derive value from? 

Transcript 

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

RAFAH HOSN: What has changed is that in the old days, we had the luxury of creating something, going and piloting for three months until we know whether it works or not, and then taking one year to productize! That … that, that doesn’t work anymore! Because guess what? In three months, this innovation is, like, topped by four other innovations, be it at Microsoft or elsewhere. So that speed is really shifting the mindset and the spirit of people. 

[TEASER ENDS] 

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward. 


[MUSIC FADES] 

My guest today is Rafah Hosn. She’s a partner, group product manager for AI Frontiers at Microsoft Research. I’d call Rafah a sort of organizational conductor, working both with leaders to drive clarity around the mission as well as program managers to make sure they have solid operational strategies to execute on it. Rafah has mad skills in bringing research ideas from lab to life, and I’m thrilled to talk to her today. Rafah Hosn, welcome to Ideas

RAFAH HOSN: Thank you, Gretchen. Oh, my goodness, I have to live up to this introduction now! [LAUGHTER] 

HUIZINGA: Well, before we talk about research ideas, let’s talk about you and your own sort of “reason for being” in the research world. How would you describe your motivation for working in research and—assuming there was one—what was the “big idea” or animating “what if?” behind what you’re doing today? 

HOSN: Yeah, you know, I don’t know. There are so many big ideas, to be honest! Every day, I wake up and I often tell my husband how lucky, like so totally lucky and humbled, I am to be where I am right now in this moment, like right now when society as we know it is being totally disrupted by this huge leap in AI. And why research? Well, I’ve tried it all, Gretchen! I’ve been in research, I went to product, I did engineering, and I did full circle and came back to research. Because, you know, for me personally, there’s no other environment that I know of, for me, that has this amount of creativity and just infinite curiosity and intellect. So working with people that are asking “what next?” and trying to imagine the next world beyond where AI is today is just … this is the big idea. This is why I’m here. This is why I’m excited to come to work every day. 

HUIZINGA: Yeah. Well … and I want to drill in a little bit just, sort of, personally because sometimes there’s a story, an origin story, if you will, of some pivotal aha moment that you say, oh, that’s fascinating, that’s cool, that’s what I want to do. Anything that piqued your interest way back when you were a kid or, sort of, a pivotal moment in your educational years? 

HOSN: Yeah, you know, so many different things that inspire you along the journey, right. It’s not just one thing, Gretchen. My dad was a doctor. He was my biggest inspiration growing up. And the reason is because he had a lot of depth of knowledge in his domain. And I wanted that. I wanted to have depth of knowledge in a domain. So I went engineering against his advice. He really wanted me to be a doctor. [LAUGHTER] So he was not too happy. But, you know, throughout my education, you know, I was there when smartphones came about, when the internet was a thing. And now, like with generative AI, I feel like I’ve lived through so many disruptions, and every one of those was, “Oh my gosh! Like, I am exactly where I want to be!” So multiple inspirations, and every day, I wake up and there’s new news and I’m saying to myself, “OK, that’s great.” I love it! 

HUIZINGA: What a time to be alive! 

HOSN: It is amazing!

HUIZINGA: Yeah. Well, you recently took on this new role in AI Frontiers at Microsoft Research. And that very word “frontiers” evokes images of unexplored, uncharted territories like the Wild West or for Trekkies, maybe “space: the final frontier.” So what does it mean to you to be working at the frontier of artificial intelligence, and what’s the big idea behind AI Frontiers? 

HOSN: You know, it’s my biggest and most exciting adventure so far! Working under Ece Kamar’s leadership in this AI Frontiers is really trying to push ourselves to think, what’s beyond what there is right now in artificial intelligence? Where can we push more, from a scientific perspective? How do we translate these scientific discoveries into capabilities that people can actually use and derive value from? It’s a big responsibility, as well, because we just don’t want to push the boundaries of AI for the sake of pushing. We want to push it in a safe and responsible way. So it is a big responsibility. 

HUIZINGA: Yeah … 

HOSN: And fundamentally, you know, the unifying big idea in this team is to explore, you know, how far can we push intelligence further into models and encapsulations of those models so that we can, you know, have not just sort of an assistant but really a personal assistant, an agent that can, kind of, do tasks for us, with us, seamlessly across multiple domains? So this is what we’re trying to push for. 

HUIZINGA: Mmm. Rafah, do you feel like you’re at the frontier of artificial intelligence? I mean, what are the emotions that crop up when you are dealing with these things—that you and your teams basically know about but the rest of us don’t?

HOSN: For most days, it’s excitement. Sometimes it’s [LAUGHTER] … it ranges, to be honest. I would say there’s a spectrum of emotions. The dominating one is really just excitement. There’s so much that has happened with GenAI, but I feel like it has opened up so many different paths, as well, for us to explore, and that’s the excitement. And then every time the world accomplishes something, you’re like in astonishment. You’re like, wow, wow. 

HUIZINGA: Yeah … 

HOSN: And then, and then, oh my gosh, what’s next? And so, it’s a range of emotions … 

HUIZINGA: Right … 

HOSN: … but I would say the dominating one is enthusiasm.

HUIZINGA: Yeah. Well, I’ve heard other people on your teams use words like surprise, sometimes even shock … 

HOSN: Yeah, yeah, there are a lot of “wow” factors. Every day, every day, I wake up, I read like my three favorite AI tweets or things like that, and I’m like, “Oh my gosh. I wouldn’t have imagined that this model could do this thing,” so [LAUGHS] … um, but it’s exciting. 

HUIZINGA: We may have to get those accounts in the show notes so that we can follow along with your surprise and amazement in the mornings! 

HOSN: [LAUGHS] Yes! 

HUIZINGA: Well, listen, when we talk about measuring the success of an AI system, we often use the common convention of what we call benchmarks. But I want to zoom out from AI systems for a minute and ask how you might measure the success of an AI lab, which is what you’re working in. What are your benchmarks or key performance indicators—we call them KPIs—for the work going on at AI Frontiers? 

HOSN: Yeah, so I’m going to start by something that may sound surprising maybe to some, but I think it’s the culture first. It’s the culture of endless curiosity, of enthusiasm coupled with a bit of skepticism, to be honest, to ask the questions, the right questions, and this drive to push further. So I would say one KPI of success for me, personally, is, you know, can we maintain this culture of enthusiasm coupled with skepticism so we can ask hard questions and an envelope of enthusiasm and drive for everyone? So that’s one. I would say the other three are … one is around how much can we push scientifically as a community, right? This is a team of people that are getting together with a mission to push the boundaries of our understanding of artificial intelligence. So are we pushing that scientific boundaries? Are we creating insights, not just for the scientific community, but also for Microsoft and the world, so that we know how to derive value from these discoveries, right? At the end of the day, it is awesome to push scientifically. It’s even more awesome if you take this and translate it into something a human being can use … 

HUIZINGA: Yeah … 

HOSN: … or an enterprise can use. And I think … that’s kind of my KPIs of success. Culture first, pushing on the scientific boundaries, creating insights for the scientific community as well as for Microsoft so we can derive value for us as a society, right. 

HUIZINGA: Yeah. Well, continuing on this idea of success, and you’ve alluded to this already in terms of characteristics of curiosity and so on, part of your job, as you put it, was “enabling brilliant minds to find success.” So talk a little bit about the personal qualities of these brilliant minds and how you help them find success.

HOSN: Yeah, you know, everybody I work with brings different aspects of brilliance to the table—every day. So in our community of engineers, PMs, researchers, everybody is present with their ideas and their strengths. And they’re pulling together to push harder and faster on our key priorities. And I find folks working in AI these days, you know, to have a renewed fire. It’s really amazing to see. And I talk a lot about curiosity, but, you know, I cannot emphasize how much this is driving a lot of our community to explore new paths that they hadn’t thought about prior to this GenAI coming along. And so everybody is showing up, present, asking these questions and trying to solve new scenarios, new problems that are emerging. And from my perspective, you know, as you mentioned, I just try to unblock, basically. My team and I are here to [LAUGHTER] … well, two things I would say. First is bring the outside-in perspective. That’s so important because science is amazing, but unless you can derive value from it, it remains an awesome paper and an awesome equation, right. So asking, who can use this? What are the scenarios it could, you know, light up? How can we derive value? So those are the questions that my team and I can contribute to, and we are trying to participate from ideation all the way to basically delivering on key milestones. And that last mile is so important. Like, once you know what you want to do, how do you structure? How do you have an operational strategy that is amenable to these times, which is fast, fast, fast, and faster? So that’s, kind of, what we’re trying to do here. 

HUIZINGA: Yeah, yeah. Well, two things came to my mind in terms of what kinds of people would end up working in this area. And one would be agility, or agile. And that would, to me, represent in a researcher that the person would be able to spin or pivot if something didn’t work out. And the other one is sort of a risk-reward mentality. It’s like, where are you willing to push to get that reward versus what might keep you from even trying? 

HOSN: Yeah, so definitely in this AI Frontiers community, I’m finding a lot of adaptability. So people willing to try, failing fast when they fail, and pivoting. And you have to, nowadays, in this atmosphere that we are living in. And because we have the privilege of working in research—and it’s really an honor and a privilege, and I’m not saying it just lightly—but it is the place where you can take risks, Gretchen. It is the place where failing is totally fine because you’re learning and you’re pivoting in a way that allows you to progress on the next thing you tackle. So I feel like most of the people I work with in this community, AI Frontiers, we are risk takers. We want to push, and it’s OK to fail, and it’s OK to adapt. So, I think, as an aggregate, that’s kind of the spirit I’m seeing. 

HUIZINGA: In the past, Rafah, you’ve stressed the importance of both teams and timing. And so we’ve been talking about the teams and the minds and the kinds of qualities in those people. But what about the “when” of research? How does timing impact what gets done in your world?

HOSN: Well, in this new era, Gretchen, everything is yesterday! [LAUGHS] I mean, it is true. AI research is moving at such speeds that I feel like we need to get accustomed to a timing of now. And if it’s not now, it’s yesterday. So the timing is important, but the leeway has shrunk so significantly that I feel like we have to really just be present in the moment and just move as fast as we can because everybody else is moving at the highest speed. So timing is “now,” is what I would say. 

HUIZINGA: On that note, with so many innovations in AI coming out every day, every minute, what you’ve just expressed is that research horizons are shorter than ever. But as one of your team members noted in a recent panel, it still takes a lot of time to translate a research artifact, maybe a noteworthy finding or a published paper or an equation, an algorithm, into a useful product for humans. So how are you then dealing with these newly compressed timelines of “it needs to be done yesterday to keep up,” and how has the traditional research-to-product pipeline changed? 

HOSN: Yeah, it’s an awesome question. It is so true that going from research to a production-quality algorithm or capability takes time. But what I’m seeing is that the research-to-capabilities is accelerating, meaning if you look at the world today in generative AI and its surrounding, folks even in research are creating assets as they are creating their research. And so they are thinking as well, how do I showcase this? And of course, these assets are not production ready. But here’s the kicker. I think that the product teams are also adapting to this generative AI era, and they are changing to meet this disruptive moment. They are changing the way they think, and they are accelerating the way they productize and look at hardening and securing the assets so that they can put them in the hands of even a limited set of users just to get a feel of what it means to have them in the hands of end users and quickly iterating so that they can further harden and further improve the design until it’s production ready. And I feel like our product partners are meeting the moments, meaning they also are really adapting their processes such that they can get these assets and put them in the hands of users and test them out before they actually release them. 

HUIZINGA: Right. Let’s drill in a little bit more on that and talk about the traditional research-to-product pipeline, where you would have a researcher working on something and then an RSDE. What does RSDE stand for? 

HOSN: A research software development engineer. It’s a mouthful. 

HUIZINGA: Right. And then to the PM, or program manager, and then to the engineer. And you’ve said this provocative statement: now everyone is a PM! 

HOSN: Everyone is a PM! [LAUGHTER] 

HUIZINGA: What do you mean by that?

HOSN: I just, I just feel like if we are to meet the moment, we need to be thinking outside-in, inside-out simultaneously. And I believe that the spirit of program management, which is looking at the design from a user-centric perspective, is embedded as we are ideating, as we are trying to explore new methodologies, new algorithms, new assets. And so what has changed is that in the old days, we had the luxury of creating something, going and piloting for three months until we know whether it works or not, and then taking one year to productize! That … that, that doesn’t work anymore. [LAUGHTER] 

HUIZINGA: Right. 

HOSN: Because guess what? In three months, this innovation is, like, topped by four other innovations, be it at Microsoft or elsewhere. So that speed is really shifting the mindset and the, and the spirit of people. I have colleagues and friends, researchers, that are asking me, oh, scenarios, users … I mean it’s amazing to see. So, yes, everybody has gotten a little PM in them now. [LAUGHTER] 

HUIZINGA: Yeah, I did a podcast with Shamsi Iqbal and Jina Suh. And Shamsi was talking about this concept, this old concept, of the researcher being in their lab and saying, well, I’ve done this work; now go see what you want to do with it. I don’t think you have that affordance anymore as a researcher. 

HOSN: No … 

HUIZINGA: You’ve got to work much more tightly with other team members and think like a PM. 

HOSN: Totally. 

HUIZINGA: So let’s talk about how the big general idea behind AI Frontiers is giving birth to smaller, more specific ideas. What are some of the research directions and projects that you could tell us about that illustrate this vision here? 

HOSN: Yeah, and I’m sure you’ve heard some of it come from Ece Kamar as she spoke on this community that we have. In AI Frontiers, we’re exploring, I would say, three major areas of research. And I want you to imagine a stack. At the bottom of the stack, we’re asking ourselves questions around, what are some new architectures we can be thinking about for these foundational models? How do we create them? What kind of data we need to train them, to pre-train them. And then on top of that stack, which starts with a foundation model, we’re asking ourselves, OK great, you have a pretrained model. In a lot of cases, when you’re creating especially small models, you need to fine-tune them. So what is this methodology and data generation pipeline that we’re going to use to fine-tune these models and specialize them for both across domains and across skill set? And on top of that—so now we’re on the third layer—we have a final layer that encapsulates these models and orchestrates among them to allow them the ability to do, you know, complex tasks. And we don’t want to stop there because for us it’s … we don’t want to have an agent that just does things and doesn’t learn. So that learnability, that learning on the job, like we do as humans, is something we’re asking ourselves, as well. Like, how do we encapsulate these models? We orchestrate among them. And we allow these encapsulated things, we call them agents, to learn on the job so that they can accomplish more complex tasks. So those are the three things. And then cutting across these three layers, imagine there’s a thing that cuts across them, is doing everything in a way that allows us to rigorously evaluate and to ensure that we’re doing things in a safe and responsible way. So those are the main things that we’re working on. Does that make sense? 

HUIZINGA: That’s … yes, it does. And I imagine, you know, if you go to the website and you see those, kind of, three main areas, I imagine that even under there, there are specific projects on, you know, how then do we iterate? How then do we explore? 

HOSN: That’s right. That’s a good plug for people to visit the AI Frontiers website! Thank you, Gretchen! [LAUGHS] 

HUIZINGA: Well, I’ve been intrigued for a while by this idea of what you’ve called bi-directional enrichment, which represents both how research informs product but also how product informs research, but you’ve recently talked about how this idea has expanded to embrace what you call multi-directional enrichment and co-innovation. So what do you mean by that, and what does it look like for you? 

HOSN: So we talked just moments ago how the time has shrunk tremendously in artificial intelligence and the speed at which innovations are coming out. So what does that mean when you are sitting in research and you’re trying to derive value for Microsoft, for example? It means that now, rather than going on a journey to try out you know different things, what you want is for product to come on a co-innovation journey with you. And not every team has the capability or the time or the resources to do it. But sometimes product teams have applied scientists that are asking themselves very similar questions. And so now we have this huge synergistic effect by which, you know, researchers can come and explore their research but anchor them in a real-world scenario that the product team is, you know, asking themselves about. And that’s what I mean by co-innovation. And we look for co-innovation, so these are product teams or applied scientists in product teams that are not looking at something I can ship tomorrow. Because that’s not … that’s not frontiers. That’s feature-function that they can deliver right now to their customers. When we co-innovate, we have to co-innovate on a bit of a longer timespan. Now it’s no longer years, right? With generative AI, everything is months, but nonetheless, this is not next week. This is in a few months. And so … but this is really, really great because, again, I keep saying this and I have maybe a huge bias, but I do believe that research, without it being anchored in real-world scenario, just doesn’t have the same effect. So I have a bias for that. It’s my PM hat, what can I say? I love real-world scenarios! [LAUGHTER] 

HUIZINGA: What you just referred to is an interesting flow. I’ve noticed in my years doing this podcast that some people that started in research ended up over in product—and we’ll call them embedded researchers, if you will—and then some people that were in a product scenario come back over to research. And so, there’s this flow, multi-directional, bi-directional, and also where they’re placed within the company. How do you see that flow and the value of that flow between these organizations? 

HOSN: Yeah, you know, like, I think that the flow is important because that’s how cross-pollination happens. And you talked about brilliant minds. In product teams, there are brilliant minds, as well, right. And although their focus area is more around the product they live and breathe every day, this is enriching to researchers and continues to be enriching because when you deploy research capabilities in a real-world setting, there are surprising new research questions that come up, not just engineering. A lot of times people think of research, OK, yeah, you scale it, you harden it, you secure it, and it’s good to go. But that’s not always the case. In a lot of cases, because of the interactivity that happens with real-world scenarios, it opened up brand-new paths for research. And so I think that flow continues to happen even now. It’s just compressed. It’s just that researchers are no longer thinking six years. Researchers are thinking three months. Like, what am I going to do in three months? Because in three months, there will be a hundred other researchers that are coming up with innovation on the same question. So I think the flow still exists. I think that time has shrunk. And I think the mobility from researchers and research going to product and vice versa is enriching for the people that do it because you gain different perspectives. 

HUIZINGA: Well, and let’s push in even there a little bit. Researchers like everyone else can get comfortable looking at things through a particular lens. I would say that’s a human trait, not just a research trait … 

HOSN: Absolutely. 

HUIZINGA: … until a disruption challenges their status quo. So you’ve talked about LLMs, which we’ve called large language models, as being a good forcing function for researchers to think differently, even about the questions they’re asking. Can you elaborate on that a little bit? 

HOSN: Yeah, yeah, so, you know, the large language models and this disruption that we are living in at the moment is lighting fire underneath a lot of people’s intellect, I’m going to say. And so I think that people have to adapt quickly to change. And this is key. Adaptability, I believe, is just a key ingredient in doing research nowadays. Why? Because a lot of people are thinking directionally the same. And so, you know, if you’re not the first, you’re going to have to adapt to what came out. And then you have to think of, how do I differentiate? So the second point I would say is differentiation. And this mindset of, you know, how do I adapt to what just came out? How do I differentiate? And then—Rafah’s bias—how do I anchor in real-world scenario? This is the home run. And I would say you package all of this and focus, focus, focus … and you get a gold mine. 

HUIZINGA: I’m hearing “yes, and …” in this response in the sense of not everyone’s going to be first, but then, what else? This is back to the frontiers. It’s like, how do I differentiate? Yes, that’s awesome. And we’ve got this … 

HOSN: Exactly. And how do I build on what has just been discovered and give it a little bit of an edge or push it a little further or take it in a brand-new direction? I mean, so many different possibilities, but it does take adaptability, like a flexibility in the mindset, I would say. 

HUIZINGA: Yeah. Well, let’s go back to what you alluded to earlier, this idea of responsible AI. This is a big deal at Microsoft. And researchers are very thoughtful about the question of what could possibly go wrong if we got everything right. But how does that translate practically, and what concrete steps are you taking at what I’ll call the “frontier of responsibility?” 

HOSN: Yeah, and as I mentioned, you know, being at the frontiers is amazing. It also holds a big responsibility. We have so many different, I would say, checks and balances that we use, in model training and fine-tuning, to ensure that we are on top of all the regulatory, the policymaker suggestions, and we are abiding by Microsoft values first and foremost and responsibility in creating these innovations. So practically and tactically, what happens is that there are processes for how you actually even release any type of model. And this is just research. And when it goes to product, they have their own compliance, you know, a stricter even compliance, I would say, process that they go through. So we try, and I try particularly, to partner with our privacy champions, with our legal champions, with our people that are looking at this from a responsible AI perspective, so that we bring them in early on, and we say, hey, we’re thinking of doing this. And they tell us, well, you know, if you’re thinking about it this way, you might want to consider this. So we’re trying to bring them in as early as possible so that also we don’t go all the way and then we discover we did something wrong, so we have to backtrack. So I would say, you know, having these partners and colleagues come in early in the game just saves everybody a lot of time. And all this responsible AI for us, it’s ingrained with how we work, meaning we bring our champions early on and then we have them advise us as we move along the journey to create these innovations. So by the time we’re done, we know we’re good, right. And even by the time we’re done, we recheck everything, we run a lot of evaluation benchmarks, and, you know, we do the right thing per policies at Microsoft. So we take it very, very seriously. 

HUIZINGA: Well, let’s go back to this idea of research horizons for a second and anchor it in the way that we approach research. So many ideas are basically iterative steps on existing work, and they make a lot of sense … this is the next step … but then there are those out-of-the-box ideas that feel like maybe bigger swings—some might even call them outrageous—and in organizations like Microsoft Research, they might get the green light, too. Where do you find this idea of the outrageous or maybe longer-term idea finding a home or a place in an organization like Microsoft Research, and have you ever worked on something that felt outrageous to you? 

HOSN: Umm, you know, we like outrageous! That’s why we’re in research, right? So outrageous is good. I haven’t, to be honest, worked on an outrageous, but I am confident I will be. So … [LAUGHTER] I just have this belief that in AI Frontiers, we are going to have outrageous ideas, and we’re going to work on them, and we’re going to make bets that basically are hard to make in other parts of the company because we have the privilege of taking them and pursuing them. And, yes, they may fail, but if we have a breakthrough, it will be a significant breakthrough. So, so I think that outrageous is good. We need to think big. We need to take big leaps, big ideas. We also need to know how to fail gracefully and pivot fast! 

HUIZINGA: Hmmm. Mmm. You know, it strikes me, and I’m laughing to myself, it strikes me, even as we’re talking, that the idea that you work in AI Frontiers, that’s outrageous to most people and, and it’s normal to you. So maybe this idea of, “I haven’t worked on anything outrageous” is like, no, you live in outrageous, it just doesn’t seem like it! [LAUGHTER] 

HOSN: Maybe. It’s my day-to-day job, so, yes, I guess you’re right. 

HUIZINGA: Right. I mean, yeah, you say, we love outrageous, and that’s where it is right now. Every day that I follow, sort of, AI Twitter also and find myself going, seriously? That happened yesterday? What next? 

HOSN: Yeah, in two hours, there’ll be yet another thing. So, yeah, I guess I am living in outrageous, and I love it! It’s amazing! [LAUGHS] 

HUIZINGA: Yeah, maybe the idea of outrageous is just changed. 

HOSN: You know, you’re so right. I think that it’s become the norm. And it is, once we anchor in generative AI, and we push further on this idea, maybe we will go back in a cycle where outrageous is outrageous, but today it’s our life. It’s where we live. It’s what we breathe every day. So it’s become a norm. 

HUIZINGA: Yeah. Well, as we close, Rafah, I want to ask a question anchored on the big idea behind AI Frontiers. What do you believe might be true in say 10 to 15 years, and what should we be doing about it now? In other words, how does what we believe about the future influence how we conceptualize and execute on ideas today? 

HOSN: Yeah, you know, it’s … I can’t even predict what I’m going to be doing tomorrow! But … [LAUGHTER] here’s, here’s what I think. I think that we are truly approaching a moment in human history where a lot of unsurmountable problems, like very hard-to-tackle diseases that have been so hard, I think we are approaching a moment, you know, soon, I hope it’s even sooner than 10 years, where generative AI and innovations on top of it could lead to a lot of resolution for things that today … that cause unsurmountable pain and suffering. I’m very hopeful that with what we are creating that we can, you know, take inefficiencies out of so many different things that we see today that take time so that we liberate ourselves to think about the “what next” societally, right? I think what we need to be doing right now, to be honest, to influence the future is think about our curricula. What are we going to teach our kids? What are they going to work in? This is where I’m hoping that we pour some of our creativity, education system. How are we preparing the next generation? What are the paths that we are going to forge for them, knowing what we know today, knowing what this technology can bring forth? So my hope is that we put some brain power into that. 

HUIZINGA: Rafah Hosn, it’s always a pleasure to talk to you. A sincere pleasure, a delight. Thanks for joining us today on Ideas

[MUSIC PLAYS] 

HOSN: Thank you so much for having me, Gretchen. 

[MUSIC FADES]

The post Ideas: Exploring AI frontiers with Rafah Hosn appeared first on Microsoft Research.

Read More

SAMMO: A general-purpose framework for prompt optimization

SAMMO: A general-purpose framework for prompt optimization

SAMMO optimizer diagram showing progression from starting prompt to optimized prompt.

Large language models (LLMs) have revolutionized a wide range of tasks and applications that were previously reliant on manually crafted machine learning (ML) solutions, streamlining through automation. However, despite these advances, a notable challenge persists: the need for extensive prompt engineering to adapt these models to new tasks. New generations of language models like GPT-4 and Mixtral 8x7B advance the capability to process long input texts. This progress enables the use of longer inputs, providing richer context and detailed instructions to language models. A common technique that uses this enhanced capacity is the Retrieval Augmented Generation (RAG) approach. RAG dynamically incorporates information into the prompt based on the specific input example. This process is illustrated in Figure 1, which shows a RAG prompt designed to translate user queries into a domain-specific language (DSL), also known as semantic parsing. 

A table showing an example metaprompt for a semantic parsing task. The underlying metaprompt consists of three larger parts, each of which comes with a variety of aspects that can be optimized. For example, the input example can be rendered using different formats, the few shot example included can be retrieved using various similarity functions, or the task description can be paraphrased.
Figure 1: A RAG prompt is used for a semantic parsing task. The underlying prompt consists of three larger parts, each with a variety of aspects that can be optimized.

The example in Figure 1 combines three distinct structures to construct the final prompt. The first structure, the task description, remains static and independent of the input as a result of conventional prompt optimization techniques. However, RAG contains two input-specific structures: the example retriever and the input text itself. These introduce numerous optimization opportunities that surpass the scope of most traditional approaches. Despite previous efforts in prompt optimization, the evolution towards more complex prompt structures has rendered many older strategies ineffective in this new context. 

SAMMO: A prompt optimization approach 

To address these challenges, we developed the Structure-Aware Multi-objective Metaprompt Optimization (SAMMO) framework. SAMMO is a new open-source tool that streamlines the optimization of prompts, particularly those that combine different types of structural information like in the RAG example above. It can make structural changes, such as removing entire components or replacing them with different ones. These features enable AI practitioners and researchers to efficiently refine their prompts with little manual effort.

Central to SAMMO’s innovation is its approach to treating prompts not just as static text inputs but as dynamic, programmable entities—metaprompts. SAMMO represents these metaprompts as function graphs, where individual components and substructures can be modified to optimize performance, similar to the optimization process that occurs during traditional program compilation.

The following key features contribute to SAMMO’s effectiveness:

Structured optimization: Unlike current methods that focus on text-level changes, SAMMO focuses on optimizing the structure of metaprompts. This granular approach facilitates precise modifications and enables the straightforward integration of domain knowledge, for instance, through rewrite operations targeting specific stylistic objectives. 
 
Multi-objective search: SAMMO’s flexibility enables it to simultaneously address multiple objectives, such as improving accuracy and computational efficiency. Our paper illustrates how SAMMO can be used to compress prompts without compromising their accuracy.

General purpose application: SAMMO has proven to deliver significant performance improvements across a variety of tasks, including instruction tuning, RAG, and prompt compression.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.


Exploring SAMMO’s impact through use cases 

Use case 1: RAG optimization 

A common application of LLMs involves translating natural user queries into domain-specific language (DSL) constructions, often to communicate with external APIs. For example, Figure 1 shows how an LLM can be used to map user queries about geography facts to a custom DSL.

In a realistic RAG scenario, SAMMO demonstrates significant performance improvements. To demonstrate this, we conducted experiments across three semantic parsing datasets of varying complexity: GeoQuery, SMCalFlow, and Overnight. Given the often limited availability of data in practical settings, we trained and tested the model on a subsampled dataset (training and retrieval set n=600, test set n=100). We compared SAMMO against a manually designed competitive baseline, using enumerative search within a search space of 24 configurations. This included variations in data formats, the number of few-shot examples, and DSL specifications.  

Evaluation 

As illustrated in Figure 2, SAMMO improved accuracy across different datasets and backend LLMs in almost all cases, with the most notable gains observed in older generation models. However, even newer models like GPT-4, SAMMO facilitated accuracy improvements exceeding 100 percent.

A series of four bar charts showing the performance of SAMMO on semantic parsing tasks. SAMMO achieves substantial improvements for most backend models and datasets.
Figure 2: For semantic parsing with RAG, SAMMO achieves substantial improvements across most backend models and datasets. 

Use case 2: Instruction tuning 

Instruction tuning addresses the optimization of static instructions given to LLMs that provide the goal and constraints of a task. To show that SAMMO extends beyond many previous prompt tuning methods, we applied this conventional setting.

To align with previous research, we used eight zero-shot BigBench classification tasks where the baseline prompt for GPT-3.5 achieved an accuracy of less than 0.9. We compared it against Automatic Prompt Optimization (APO) and GrIPS, applying open-source models Mixtral 7x8B and Llama-2 70B, alongside GPT-3.5 as backend LLMs. We did not include GPT-4 due to minimal improvement potential identified in pilot experiments. The results, shown in Figure 3, demonstrate that SAMMO outperformed all baselines regardless of the backend model, proving its effectiveness with even more complex metaprompts.

A series of three bar charts comparing the accuracy of different methods on instruction tuning. SAMMO matches or exceeds the performance of competing methods for instruction tuning on classification tasks.
Figure 3: SAMMO does at least as well as older methods for instruction tuning on simpler tasks.

Implications and looking forward

SAMMO introduces a new and flexible approach to optimize prompts for specific requirements. Its design works with any LLM, and it features versatile components and operators suitable for a broad range of applications.

We are excited to integrate and apply SAMMO to the components and pipelines behind AI-powered assistant technologies. We also hope to establish a user-driven community centered around SAMMO, where people can exchange best practices and patterns, and encourage the expansion of the existing set of search operators.

The post SAMMO: A general-purpose framework for prompt optimization appeared first on Microsoft Research.

Read More

Research Focus: Week of April 15, 2024

Research Focus: Week of April 15, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus April 15, 2024

Appropriate reliance on Generative AI: Research synthesis

Appropriate reliance on AI happens when people accept correct AI outputs and reject incorrect ones. It requires users of AI systems to know when to trust the AI and when to trust themselves. But fostering appropriate reliance comes with new complexities when generative AI (genAI) systems are involved. Though their capabilities are advancing, genAI systems, which use generative models to produce content such as text, music, images, and videos, have limitations as well. Inappropriate reliance – either under-reliance or overreliance – on genAI can have negative consequences, such as poor task performance and even product abandonment.  

In a recent paper: Appropriate reliance on Generative AI: Research synthesis, researchers from Microsoft, who reviewed 50 papers from various disciplines, provide an overview of the factors that affect overreliance on genAI, the effectiveness of different mitigation strategies for overreliance on genAI, and potential design strategies to facilitate appropriate reliance on genAI. 


Characterizing Power Management Opportunities for LLMs in the Cloud

Cloud providers and datacenter operators are grappling with increased demand for graphics processing units (GPUs) due to expanding use of large language models (LLMs). To try to keep up, enterprises are exploring various means to address the challenge, such as power oversubscription and adding more servers. Proper power usage analysis and management could help providers meet demand safely and more efficiently. 

In a recent paper: Characterizing Power Management Opportunities for LLMs in the Cloud, researchers from Microsoft analyze power patterns for several popular, open-source LLMs across commonly used configurations and identify opportunities to improve power management for LLMs in the cloud. They present a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, POLCA simulations demonstrate it could deploy 30% more servers in existing clusters while incurring minimal power throttling events. POLCA improves power efficiency, reduces the need for additional energy sources and datacenters, and helps to promptly meet demand for running additional LLM workloads. 

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Various prompting techniques, such as chain-of-thought (CoT), in-context learning (ICL), and retrieval augmented generation (RAG), can empower large language models (LLMs) to handle complex and varied tasks through rich and informative prompts. However, these prompts are lengthy, sometimes exceeding tens of thousands of tokens, which increases computational and financial overhead and degrades the LLMs’ ability to perceive information. Recent efforts to compress prompts in a task-aware manner, without losing essential information, have resulted in shorter prompts tailored to a specific task or query. This typically enhances performance on downstream tasks, particularly in question answering. However, the task-specific features present challenges in efficiency and generalizability. 

In a recent paper: LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, researchers from Microsoft and Tsinghua University propose a data distillation procedure to derive knowledge from an LLM (GPT-4) and compress the prompts without losing crucial information. They introduce an extractive text compression dataset, containing pairs of original texts from MeetingBank and their compressed versions. Despite its small size, their model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. The new model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. 


AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Despite recent progress in scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging. Evaluation is often performed using n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics like COMET have a higher correlation; however, challenges such as the lack of evaluation data with human ratings for under-resourced languages, the complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and the limited language coverage of multilingual encoders, have hampered their applicability to African languages. 

In a recent paper: AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages (opens in new tab), researchers from University College London, University of Maryland, Unbabel, Microsoft and the Masakhane Community (opens in new tab), address these challenges, creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. They also develop AFRICOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLMR) to create state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441). 


Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System

Video communication often lacks the inclusiveness and simultaneity enabled by physical presence in a shared space. This is especially apparent during hybrid meetings, where some attendees meet physically in a room while others join remotely. Remote participants are at a disadvantage, unable to navigate the physical space like in-room participants. 

In a Late Breaking Work paper to be presented at CHI2024: Comparing the Agency of Hybrid Meeting Remote Users in 2D and 3D Interfaces of the Hybridge System,” Microsoft researchers present an experimental system for exploring designs for improving the inclusion of remote attendees in hybrid meetings. In-room users see remote participants on individual displays positioned around a table. Remote participants see video feeds from the room integrated into a digital twin of the meeting room, choosing where they appear in the meeting room and from where they view it. The researchers designed both a 2D and a 3D version of the interface. They found that 3D outperformed 2D in participants’ perceived sense of awareness, sense of agency, and physical presence. A majority of participants also subjectively preferred 3D over 2D. The next step in this research will test the inclusiveness of Hybridge 3D meetings against fully in-room meetings and traditional hybrid meetings. 


FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction. This is because models like transformers and convolutional networks aggressively pool information over large areas. 

In a paper that was published at ICLR 2024: FeatUp: A Model-Agnostic Framework for Features at Any Resolution, researchers from Microsoft and external colleagues introduce a task- and model-agnostic framework to restore lost spatial information in deep features. The paper introduces two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multiview consistency loss with deep analogies to neural radiance fields (NeRFs), a deep learning method of building 3D representations of a scene using sparse 2D images. In the new research, features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains, even without re-training. FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation. 

The post Research Focus: Week of April 15, 2024 appeared first on Microsoft Research.

Read More

Abstracts: April 16, 2024

Abstracts: April 16, 2024

Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Research Software Engineer Tusher Chakraborty joins host Gretchen Huizinga to discuss “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things,” which was accepted at the 2024 USENIX Symposium on Networked Systems Design and Implementation (NSDI). In the paper, Chakraborty and his coauthors share their efforts to address the challenges of delivering reliable and affordable IoT connectivity via satellite-based networks. They propose a method for leveraging the motion of small satellites to facilitate efficient communication between a large IoT-satellite constellation and devices on Earth within a limited spectrum.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m talking today to Tusher Chakraborty, a senior research software engineer at Microsoft Research. Tusher is coauthor of a paper called “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things.” Tusher, thanks for joining us on Abstracts!


TUSHER CHAKRABORTY: Hi. Thank you for having me here, Gretchen, today. Thank you.

HUIZINGA: So because this show is all about abstracts, in just a few sentences, tell us about the problem your paper addresses and why we should care about it.

CHAKRABORTY: Yeah, so think of, I’m a farmer living in a remote area and bought a sensor to monitor the soil quality of my farm. The big headache for me would be how to connect the sensor so that I can get access to the sensor data from anywhere. We all know that connectivity is a major bottleneck in remote areas. Now, what if, as a farmer, I could just click the power button of the sensor, and it gets connected from anywhere in the world. It’s pretty amazing, right? And that’s what our research is all about. Get your sensor devices connected from anywhere in the world with just the click of power button. We call it one-click connectivity. Now, you might be wondering, what’s the secret sauce? It’s not magic; it’s direct-to-satellite connectivity. So these sensors directly get connected to the satellites overhead from anywhere on Earth. The satellites, which are orbiting around the earth, collect the data from the sensing devices and forward to the ground stations in some other convenient parts of the world where these ground stations are connected to the internet.

HUIZINGA: So, Tusher, tell us what’s been tried before to address these issues and how your approach contributes to the literature and moves the science forward.

CHAKRABORTY: So satellite connectivity is nothing new and has been there for long. However, what sets us apart is our focus on democratizing space connectivity, making it affordable for everyone on the planet. So we are talking about the satellites that are at least 10 to 20 times cheaper and smaller than state-of-the-art satellites. So naturally, this ambitious vision comes with its own set of challenges. So when you try to make something cheaper and smaller, you’ll face lots of challenges that all these big satellites are not facing. So if I just go a bit technical, think of the antenna. So these big satellite antennas, they can actually focus on particular part of the world. So this is something called beamforming. On the other hand, when we try to make the satellites cheaper and smaller, we can’t have that luxury. We can’t have beamforming capability. So what happens, they have omnidirectional antenna. So it seems like … you can’t focus on a particular part of the earth rather than you create a huge footprint on all over the earth. So this is one of the challenges that you don’t face in the state-of-the-art satellites. And we try to solve these challenges because we want to make connectivity affordable with cheaper and smaller satellites.

HUIZINGA: Right. So as you’re describing this, it sounds like this is a universal problem, and people have obviously tried to make things smaller and more affordable in the past. How is yours different? What methodology did you use to resolve the problems, and how did you conduct the research?

CHAKRABORTY: OK, I’m thrilled that you asked this one because the research methodology was the most exciting part for me here. As a part of this research, we launched a satellite in a joint effort with a satellite company. Like, this is very awesome! So it was a hands-on experience with a real-deal satellite system. It was not simulation-based system. The main goal here was to learn the challenge from a real-world experience and come up with innovative solutions; at the same time, evaluate the solutions in real world. So it was all about learning by doing, and let me tell you, it was quite the ride! [LAUGHTER] We didn’t do anything new when we launched the satellites. We just tried to see how industry today does this. We want to learn from them, hey, what’s the industry practice? We launched a satellite. And then we faced a lot of problems that today’s industry is facing. And from there, we learned, hey, like, you know, this problem is industry facing; let’s go after this, and let’s solve this. And then we tried to come up with the solutions based on those problems. And this was our approach. We didn’t want to assume something beforehand. We want to learn from how industry is going today and help them. Like, hey, these are the problems you are facing, and we are here to help you out.

HUIZINGA: All right, so assuming you learned something and wanted to pass it along, what were your major findings?

CHAKRABORTY: OK, that’s a very good question. So I was talking about the challenges towards this democratization earlier, right? So one of the most pressing challenges: shortage of spectrum. So let me try to explain this from the high level. So we need hundreds of these satellites, hundreds of these small satellites, to provide 24-7 connectivity for millions of devices around the earth. Now, I was talking, the footprint of a satellite on Earth can easily cover a massive area, somewhat similar to the size of California. So now with this large footprint, a satellite can talk with thousands of devices on Earth. You can just imagine, right? And at the same time, a device on Earth can talk with multiple satellites because we are talking about hundreds of these satellites. Now, things get tricky here. [LAUGHTER] We need to make sure that when a device and a satellite are talking, another nearby device or a satellite doesn’t interfere. Otherwise, there will be chaos—no one hearing others properly. So when we were talking about this device and satellite chat, right, so what is that all about? This, all about in terms of communication, is packet exchange. So the device sends some packet to the satellite; satellite sends some packet to the device—it’s all about packet exchange. Now, you can think of, if multiple of these devices are talking with a satellite or multiple satellites are talking with a device, there will be a collision in this packet exchange if you try to send the packets at the same time. And if you do that, then your packet will be collided, and you won’t be able to get any packet on the receiver end. So what we do, we try to send this packet on different frequencies. It’s like a different sound or different tone so that they don’t collide with each other. And, like, now, I said that you need different frequencies, but frequency is naturally limited. And the choice of frequency is even limited. This is very expensive. But if you have limited frequency and you want to resolve this collision, then you have a problem here. How do you do that? So we solve this problem by smartly looking at an artifact of these satellites. So these satellites are moving really fast around the earth. So when they are moving very fast around the earth, they create a unique signature on the frequency that they are using to talk with the devices on Earth. And we use this unique signature, and in physics, this unique signature is known as Doppler signature. And now you don’t need a separate frequency to sound them different, to have packets on different frequencies. You just need to recognize that unique signature to distinguish between satellites and distinguish between their communications and packets. So in that sense, there won’t be any packet collision. And this is all about our findings. So with this, now multiple devices and satellites can talk with each other at the same time without interference but using the same frequency.

HUIZINGA: It sounds, like, very similar to a big room filled with a lot of people. Each person has their own voice, but in the mix, you, kind of, lose track of who’s talking and then you want to, kind of, tune in to that specific voice and say, that’s the one I’m listening to.

CHAKRABORTY: Yeah, I think you picked up the correct metaphor here! This is the scenario you can try to explain here. So, yeah, like what we are essentially doing, like, if you just, in a room full of people and they are trying to talk with each other, and then if they’re using the same tone, no one will be distinguished one person from another.

HUIZINGA: Right …

CHAKRABORTY: Everyone will sound same and that will be colliding. So you need to make sure that, how you can differentiate the tones …

HUIZINGA: Yeah …

CHAKRABORTY: … and the satellites differentiate their tones due to their fast movement. And we use our methodology to recognize that tone, which satellite is sending that tone.

HUIZINGA: So you sent up the experimental satellite to figure out what’s happening. Have you since tested it to see if it works?

CHAKRABORTY: Yeah, yeah, so we have tried it out, because this is a software solution, to be honest.

HUIZINGA: Ah.

CHAKRABORTY: As I was talking about, there is no hardware modification required at this point. So what we did, we just implemented this software in the ground stations, and then we tried to recognize which satellite is creating which sort of signature. That’s it!

HUIZINGA: Well, it seems like this research would have some solid real-world impact. So who would you say it helps most and how?

CHAKRABORTY: OK, that’s a very good one. So the majority of the earth still doesn’t have affordable connectivity. The lack of connectivity throws a big challenge to critical industries such as agriculture—the example that I gave—energy, and supply chain, so hindering their ability to thrive and innovate. So our vision is clear: to bring 24-7 connectivity for devices anywhere on Earth with just a click of power button. Moreover, affordability at the heart of our mission, ensuring that this connectivity is accessible to all. So in core, our efforts are geared towards empowering individuals and industries to unlock their full potential in an increasingly connected world.

HUIZINGA: If there was one thing you want our listeners to take away from this research, what would it be?

CHAKRABORTY: OK, if there is one thing I want you to take away from our work, it’s this: connectivity shouldn’t be a luxury; it’s a necessity. Whether you are a farmer in a remote village or a business owner in a city, access to reliable, affordable connectivity can transform your life and empower your endeavors. So our mission is to bring 24-7 connectivity to every corner of the globe with just a click of a button.

HUIZINGA: I like also how you say every corner of the globe, and I’m picturing a square! [LAUGHTER] OK, last question. Tusher, what’s next for research on satellite networks and Internet of Things? What big unanswered questions or unsolved problems remain in the field, and what are you planning to do about it?

CHAKRABORTY: Uh … where do I even begin? [LAUGHTER] Like, there are countless unanswered questions and unsolved problems in this field. But let me highlight one that we talked here: limited spectrum. So as our space network expands, so does our need for spectrum. But what’s the tricky part here? Just throw more and more spectrum. The problem is the chunk of spectrum that’s perfect for satellite communication is often already in use by the terrestrial networks. Now, a hard research question would be how we can make sure that the terrestrial and the satellite networks coexist in the same spectrum without interfering [with] each other. It’s a tough nut to crack, but it’s a challenge we are excited to tackle head-on as we continue to push the boundaries of research in this exciting field.

[MUSIC]

HUIZINGA: Tusher Chakraborty, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab). You can also read it on the Networked Systems Design and Implementation, or NSDI, website, and you can hear more about it at the NSDI conference this week. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: April 16, 2024 appeared first on Microsoft Research.

Read More

Ideas: Language technologies for everyone with Kalika Bali

Ideas: Language technologies for everyone with Kalika Bali

Microsoft Research Podcast | Ideas | Kalika Bali

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets. 

In this episode, host Gretchen Huizinga talks with Principal Researcher Kalika Bali. Inspired by an early vision of “talking computers” and a subsequent career in linguistics, Bali has spent the last two decades bringing the two together. Aided by recent advances in large language models and motivated by her belief that everyone should have access to AI in their own language, Bali and her teams are building language technology applications that they hope will bring the benefits of generative AI to under-resourced and underserved language communities around the world.

Transcript 

[TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

KALIKA BALI: I do think, in some sense, the pushback that I got for my idea makes me think it was outrageous. I didn’t think it was outrageous at all at that time! I thought it was a very reasonable idea! But there was a very solid pushback and not just from your colleagues. You know, for researchers, publishing papers is important! No one would publish a paper which focused only on, say, Indian languages or low-resource languages. We’ve come a very long way even in the research community on that, right. We kept pushing, pushing, pushing! And now there are tracks, there are workshops, there are conferences which are devoted to multilingual and low-resource languages. 

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward. 


[MUSIC FADES] 

I’m excited to be live in the booth today with Kalika Bali, a principal researcher at Microsoft Research India. Kalika is working on language technologies that she hopes will bring the benefits of generative AI to under-resourced and underserved language communities around the world. Kalika, it’s a pleasure to speak with you today. Welcome to Ideas

KALIKA BALI: Thank you. Thank you, Gretchen. Thank you for having me. 

HUIZINGA: So before we dive in on the big ideas behind Kalika Bali’s research, let’s talk about you for a second. Tell us about your “origin story,” as it were, and if there is one, what “big idea” or animating “what if?” captured your imagination and inspired you to do what you’re doing today? 

BALI: So, you know, I’m a great reader. I started reading well before I was taught in school how to read, and I loved science fiction. I come from a family where reading was very much a part of our everyday lives. My dad was a journalist, and I had read a lot of science fiction growing up, and I also saw a lot of science fiction, you know, movies … Star Trek … everything that I could get hold of in India. And I remember watching 2001: Space Odyssey. And there was this HAL that spoke. He actually communicated that he was a computer. And I was just so struck by it. I was like, this is so cool! You know, here are computers that can talk! Now, how cool would that be if it would happen in real life? I was not at all aware of what was happening in speech technology, whether it was possible or not possible, but that’s something that really got me into it. I’ve always, like, kind of, been very curious about languages and how they work and, you know, how people use different things in languages to express not just meaning, not just communicating, but you know expressing themselves, really. And so I think it’s a combination of HAL and this curiosity I had about the various ways in which people use languages that got me into what I’m doing now. 

HUIZINGA: OK. So that’s an interesting path, and I want to go into that just a little bit, but let me anchor this: how old were you when you saw this talking computer? 

BALI: Oh, I was in my early teens. 

HUIZINGA: OK. And so at that time, did you have any conception that … ? 

BALI: No. You know, there weren’t computers around me when I was growing up. We saw, you know, some at school, you know, people coded in BASIC … 

HUIZINGA: Right? 

BALI: And we heard about them a lot, but I hadn’t seen one since I was in high school. 

HUIZINGA: OK. So there’s this inception moment, an aha moment, of that little spark and then you kind of drifted away from the computer side of it, and what … tell us about how you went from there to that! 

BALI: So that, that’s actually a very funny story because I actually wanted to study chemistry. I was really fascinated by how these, you know, molecular parts rotate around each other and, you know, we can’t even tell where an electron is, etc. It sounded, like, really fun and cool. So I actually studied chemistry, but then I was actually going to pick up the admission form for my sister, who wanted to study in this university, and … or, no, she wanted to take an exam for her master’s. And I went there. I picked up the form, and I said, this is a cool place. I would love to study here! And then I started looking at everything like, you know, what can I apply for here? And something called linguistics came up, and I had no idea what linguistics was. So I went to the British Library, got like a thin book on introduction to linguistics, and it sounded fun! And I took the exam. And then, as they say, that was history. Then I just got into it. 

HUIZINGA: OK. I mean, so much has happened in between then and now, and I think we’ll kind of get there in … but I do want you to connect the larger dot from how you got from linguistics to Microsoft Research [LAUGHTER] as a computer scientist.

BALI: So I actually started teaching at the University of South Pacific as a linguistics faculty in Fiji. And I was very interested in acoustics of speech sounds, etc., etc. That’s what I was teaching. And then there was a speech company in Belgium that was looking to start some work in Indian languages, and they contacted me, and at that time, you needed people who knew about languages to build language technology, especially people who knew about phonetics, acoustics, for speech technology. And that’s how I got into it. And then, you know, I just went from startups to companies and then Microsoft Research, 18 years ago, almost 18 years ago. 

HUIZINGA: Wow. OK. I would love to actually talk to you about all that time. But we don’t have time because I have a lot more things to talk to you about, technology-wise. But I do want to know, you know, how would you describe the ideas behind your overarching research philosophy, and who are your influences, as they say in the rock-and-roll world? [LAUGHTER] Who inspired you? Real-life person, scientist or not, besides, HAL 9000, who’s fictional, and any seminal papers that, sort of, got you interested in that along the way? 

BALI: So since I was really into speech, Ken Stevens—who was a professor, who sadly is no longer with us anymore, at MIT—was a big influence. He, kind of, had this whole idea of how speech is produced. And, you know, the first time I was exposed to the whole idea of the mathematics behind the speech, and I think he influenced me a lot on the speech side of things. For the language side of things, you know, my professor in India Professor Anvita Abbi—you know, she’s a Padma Shri, like, she’s been awarded by the Indian government for her work in, you know, very obscure, endangered languages—you know, she kind of gave me a feel for what languages are, and why they are important, and why it’s important to save them and not let them die away. 

HUIZINGA: Right.

BALI: So I think I would say both of them. But what really got me into wanting to work with Indian language technology in a big way was I was working in Belgium, I was working in London, and I saw the beginning of how technology is, kind of, you know, making things easier, exciting; there’s cool technology available for English, for French, for German … But in a country like India, it was more about giving access to people who have no access, right? It actually mattered, because here are people who may not be very literate and therefore not be able to use technology in the way we know it, but they can talk. 

HUIZINGA: Right. 

BALI: And they can speak, and they should be able to access technology by doing that. 

HUIZINGA: Right. OK. So just real quickly, that was then. What have you seen change in that time, and how profoundly have the ideas evolved? 

BALI: So just from pure methodology and what’s possible, you know, I have seen it all. When I started working in language technology, mainly for Indian languages, but even for other languages, it was all a rule-based system. So everybody had to create all these rules that then were, you know, responsible for building or like making that technology work. But then, just at that time, you know, all the statistical systems and methodologies came into being. So we had hidden Markov models, you know, doing their thing in speech, and it was all about a lot of data. But that data still had to be procured in a certain way, labeled, annotated. It was still a very long and resource-intensive process. Now, with generative AI, the thing that I am excited about is, we have a very powerful tool, right? 

HUIZINGA: Mm-hmm. 

BALI: And, yes, it requires a lot of data, but it can learn also; you know, we can fine-tune stuff on smaller datasets … 

HUIZINGA: Yeah … 

BALI: … to work for, you know, relevant things. So it’s not going to take me years and years and years to first procure the data, then have it tagged for part of speech … then, you know, have it tagged for sentiment, have it tagged for this, have it tagged for that, and then, only can I think of building anything. 

HUIZINGA: Right.

BALI: So it just shortens that timeline so much, and it’s very exciting. 

HUIZINGA: Right. As an ex-English teacher—which I don’t think there is such a thing as an ex-English teacher; you’re always silently correcting someone’s grammar! [LAUGHTER]—just what you said about tagging parts of speech as what they are, right? And that, I used to teach that. And then you start to think, how would you translate that for a machine? So fascinating. So, Kalika, you have said that your choice of career was accidental—and you’ve alluded to the, sort of, the fortuitous things that happened along the way—but that linguistics is one subject that goes from absolute science to absolute philosophy. Can you unpack that a little bit more and how this idea impacted your work in language technology? 

BALI: Yeah. So, so if you think about it, you know, language has a physical aspect, right. We move our various speech organs in a certain way. Our ears are constructed in a certain way. There is a physics of it where, when I speak, there are sound waves, right, which are going into your ear, and that’s being interpreted. So, you know, if you think about that, that’s like an absolute science behind it, right? But then, when you come to the structure of language, you know, the syntax, like you’re an English teacher, so you know this really well, that you know, there’s semantics; there’s, you know, morphology, how our words form, how our sentences form. And that’s like a very abstract kind of method that allows us to put, you know, meaningful sentences out there, right? 

HUIZINGA: Right … 

BALI: But then there’s this other part of how language works in society, right. The way I talk to my mother would be probably very different to the way I’m talking to you, would be very different from the way I talk to my friends, at a very basic level, right? The way, in India, I would greet someone older to me would be very different from the way I would greet somebody here, because here it’s like much less formal and that, you know, age hierarchy is probably less? If I did the same thing in India, I would be considered the rudest creature ever. [LAUGHS] So … and then, you know, you go into the whole philosophy—psycholinguistics part. What happens in our brains, you know, when we are speaking? Because language is controlled by various parts of our brain, right. And then, you go to the pure philosophy part, like why? How does language even occur? Why do we name things the way we name things? You know, why do we have a language of thought? You know, what language are we thinking in? [LAUGHTER] 

HUIZINGA: Right. 

BALI: So, so it really does cover the entire gamut of language … 

HUIZINGA: Yeah, yeah, yeah … 

BALI: … like from science to philosophy. 

HUIZINGA: Yeah, as I said before, when we were talking out there, my mother-in-law was from Holland, and every time she did math or adding, she would do it in Dutch, which—she’d be speaking in English and then she’d go over here and count in Dutch out loud. And it’s like, yeah, your brain switches back and forth. This is so exciting to me. I had no idea how much I would love this podcast! So, much of your research is centered on this big idea called “design thinking,” and it’s got a whole discipline in universities around the world. And you’ve talked about using something you call the 4D process for your work. Could you explain that process, and how it plays out in the research you do with the communities you serve?

BALI: Yeah, so we’ve kind of adapted this. My ex-colleague Monojit Choudhury and I, kind of, came up with this whole thing about 4D thinking, which is essentially discover, design, develop and deploy, right. And when we are working with, especially with, marginalized or low-resource-language communities, the very basic thing we have to do is discover, because we cannot go with, you know, our own ideas and perceptions about what is required. And I can give you a very good example of this, right. You know, most of us, as researchers and technologists, when we think of language technology, we are thinking about machine translation; we’re thinking about speech recognition; we are thinking about state-of-the-art technology. And here we were talking to a community that spoke the language Idu Mishmi, which is a very small community in northeast of India. And we were talking about, you know, we can do this, we can do that. And they just turned to us and said, what we really want is a mobile digital dictionary! [LAUGHS] 

HUIZINGA: Wow. Yeah … 

BALI: Right? And, you know, if you don’t talk, if you don’t observe, if you are not open to what the community’s needs might be, then you’ll miss that, right. You’ll miss the real thing that will make a difference to that community. So that’s the discover part. The design part, again, you have to design with the community. You cannot go and design a system that they are unable to use properly, right. And again, another very good example, one of the people I know, you know, he gave me this very good example of why you have to think, even at the architecture level when you’re designing such things, is like a lot of applications in India and around the world require your telephone number for verification. Now, for women, it might be a safety issue. They might not want to give their telephone number. Or in India, many women might not even have a telephone, like a mobile number, right. So how do you think of other ways in which they can verify, right? And so that’s the design part. The develop and the deploy part, kind of, go hand in hand, because I think it’s a very iterative process. You develop quickly, you put it out there, allow it to fail and, you know … 

HUIZINGA: Mm-hmm. Iterate … 

BALI: Iterate. So that’s like the, kind of, design thinking that we have. 

HUIZINGA: Yeah, I see that happening in accessibility technology areas, too, as well as language … 

BALI: Yeah, and, you know, working with the communities, very quickly, you become really humble.

HUIZINGA: Sure.

BALI: There’s a lot of humility in me now. Though I have progressed in my career and, you know, supposedly become wiser, I am much more humble about what I know and what I can do than I was when I started off, you know. 

HUIZINGA: I love that. Well, one thing I want to talk to you about that has intrigued me, there’s a thing that happens in India where you mix languages … 

BALI: Yes!

HUIZINGA: You speak both Hindi and English at the same time, and you think, oh, you speak English, but it’s like, no, there’s words I don’t understand in that. What do you call that, and how did that drive your interest? I mean, that was kind of an early-on kind of thing in your work, right? Talk about that. 

BALI: So that’s called code-mixing or code-switching. The only linguistic difference is code-mixing happens within a sentence, and code-switching means one sentence in one language and another. 

HUIZINGA: Oh, really? 

BALI: Yeah. So … but this is, like, not just India. This is a very, very common feature of multilingual societies all over the world. So it’s not multilingual individuals, but at the societal level, when you have multilingualism, then, you know, this is a marker of multilingualism. But code-mixing particularly means that you have to be fluent in both languages to actually code-mix, right. You have to have a certain amount of fluency in both languages. And there are various reasons why people do this. You know, it’s been studied by psychologists and linguists for a long time. And for most people like me, multilingual people, that’s the language we dream in, we think about. [LAUGHTER] That’s the language we talk to our siblings and friends in, right. And for us, it’s, like, just natural. We just keep … 

HUIZINGA: Mixing … 

BALI: … flipping between the two languages for a variety of reasons. We might do it for emphasis; we might do it for humor. We might just decide, OK, I’m going to pick this from this … the brain decides I’m going to pick this from this language … 

HUIZINGA: Sure. 

BALI: … and this … So the reason we got interested in, like, looking into code-mixing was that when we are saying that we want humans to be able to interact with machines in their most natural language, then by some estimates, half the world speaks like this! 

HUIZINGA: Right. 

BALI: So we have to be able to understand exactly how they speak and, you know, be able to process and understand their language, which is code-mixed … 

HUIZINGA: Sure. Well, it seems like the human brain can pick this up and process it fairly quickly and easily, especially if it knows many languages. For a machine, it would be much more difficult? 

BALI: It is. So initially, it was really difficult because, you know, the way we created systems was one language at a time … 

HUIZINGA: Right! 

BALI: … right. And it’s not about having an English engine and a Hindi engine available. It doesn’t work that way. 

HUIZINGA: No!

BALI: So you’d really need something that, you know, is able to tackle the languages together. And in some theories, this is almost considered a language of its own because it’s not like you’re randomly mixing. There is a structure to … 

HUIZINGA: Oh, is there? 

BALI: Yeah. Where you can, where you can’t … 

HUIZINGA: Gotcha. 

BALI: You know, so there is a structure or grammar, you can say, of code-mixing. So we went after that. We, kind of, created tools which could generate grammatically viable code-mixed sentences given parallel data, etc. 

HUIZINGA: That’s awesome. Amazing.

BALI: So, yeah, it takes effort to do it. But again, right now, because the generative AI models have at their disposal, you know, so many languages and at least, like, theoretically can work in many, many, many languages, you know, code-mixing might be an easier problem to solve right now. 

HUIZINGA: Right. OK. So we’re talking mostly about widely used languages, and you’re very concerned right now on this idea of low-resource languages. So unpack what you mean by low-resource, and what’s missing from the communities that speak those languages? 

BALI: Yeah. So when we say low-resource languages, we typically mean that languages do not have, say, digital resources, linguistic resources, language resources, that would enable technology building. It doesn’t mean that the communities themselves are impoverished in culture or linguistic richness, etc., right. But the reason why these communities do not have a lot of language resources, linguistic resources, digital resources, most of the time, it is because they are also marginalized in other ways … social and economic marginalization. 

HUIZINGA: Right. 

BALI: And these are … if you look at them, they’re not ti—I mean, of course, some of them are tiny, but when we say low-resource communities, we are talking about really big numbers. 

HUIZINGA: Oh, really? 

BALI: Yeah. So one of the languages that I have worked with—language communities that I’ve worked with—speak a language called Gondi, which is like a Dravidian language that is spoken in … like a South Indian language that is spoken in north, central-north area. It’s a tribal language, and it’s got around three million speakers.

HUIZINGA: Oh, wow! 

BALI: Yeah. That’s like more than Welsh, … 

HUIZINGA: Yeah! [LAUGHS] 

BALI: … right? But because socio-politically, they have been—or economically, they have been marginalized, they do not have the resources to build technologies. And, you know, when we say empower everyone and we only empower the top tier, I don’t think we fulfill our ambition to empower everyone. And like I said earlier, for these communities, all the technology that we have, digital tools that we have access to, they really matter for them. So, for example, you know, a lot of government schemes or the forest reserve laws are provided, say, in Hindi. If they are provided in Gondi, these people have a real idea of what they can do. 

HUIZINGA: Yeah. … Sure. 

BALI: Similarly, for education, you know, there are books and books and books in Hindi. There’s no book available for Gondi. So how is the next generation even going to learn the language? 

HUIZINGA: Right. 

BALI: And there are many, many languages which are low resource. In fact, you know, we did a study sometime in 2020, I think, we published this paper on linguistic diversity, and there we saw that, you know, we divided languages in five categories, and the top most which have all the resources to build every possible technology have only five languages, right. And more than half of the world’s languages are at the bottom. So it is a big problem. 

HUIZINGA: Yeah. Let’s talk about some of the specific technologies you’re working on. And I want to go from platform to project because you’ve got a big idea in a platform you call VeLLM. Talk about that. 

BALI: So VeLLM, which actually means jaggery—the sweet, sugary jaggery—in Tamil, one of the languages in India … 

HUIZINGA: Let me, let me interject that it’s not vellum like the paper, or what you’re talking about. It’s capital V, little e, and then LLM, which stands for large language model? 

BALI: So universal, the “V” comes from there. Empowerment, “e” comes from there. Through large language models … 

HUIZINGA: Got it. OK. But you shortened it to VeLLM. 

BALI: Yeah. 

HUIZINGA: OK.

BALI: So, so the thing with VeLLM is that a bunch of us got together just when this whole GPT was released, etc. We have a very strong group that works on technologies for empowerment in the India lab, Microsoft Research India. And we got together to see what it is that we can do now that we have access to such a strong and powerful tool. And we started thinking of the work that we’ve been doing, which is to, you know, build these technologies for specific areas and specific languages, specific demographies. So we, kind of, put all that knowledge and all that experience we had and thought of like, how can we scale that, really, across everything that we do? So VeLLM, at its base, you know, takes a GPT-like LLM, you know, as a horizontal across everything. On top of it, we have again, horizontals of machine learning, of multilingual tools and processes, which allow us to take the outputs from, say, GPT-like things and adapt it to different languages or, you know, some different kind of domain, etc. And then we have verticals on top of it, which allow people to build specific applications. 

HUIZINGA: Let me just go back and say GPT … I think most of our audience will know that that stands for generative pretrained transformer models. But just so we have that for anyone who doesn’t know, let’s anchor that. So VeLLM basically was an enabling platform … 

BALI: Yes. 

HUIZINGA: … on which to build specific technologies that would solve problems in a vertical application. 

BALI: Yes. Yes. And because it’s a platform, we’re also working on tools that are needed across domains … 

HUIZINGA: Oh, interesting. 

BALI: … as well as tools that are needed for specific domains. 

HUIZINGA: OK, so let’s talk about some of the specifics because we could get into the weeds on the tools that everybody needs, but I like the ideas that you’re working on and the specific needs that you’re meeting, the felt-need thing that gets an idea going. So talk about this project that you’ve worked on called Kahani. Could you explain what that is, and how it works? It’s really interesting to me. 

BALI: So Kahani, actually, is about storytelling, culturally appropriate storytelling, with spectacular images, as well as like textual story. 

HUIZINGA: So visual storytelling? 

BALI: Visual storytelling with the text. So this actually started when my colleague Sameer Segal, he was trying to use generative AI to create stories for his daughter, and he discovered that, you know, things are not very culturally appropriate! So I’ll give an example that, you know, if you want to take Frozen and take it to, like, the south Indian state of Kerala, you’ll have the beaches of Kerala, you’ll have even have the coconut trees, but then you will have this blond princess in a princess gown … 

HUIZINGA: Sure …

BALI: … who’s there, right? So that’s where we started discussing this, and we, kind of, started talking about, how can we create visuals that are anchored on text of a story that’s culturally appropriate? So when we’re talking about, say, Little Red Riding Hood, if we ask the generative AI model, OK, that I want the story of Little Red Riding Hood but in an Indian context, it does a fantastic job. It actually gives you a very nice story, which, you know, just reflects the Red Riding Hood story into an Indian context. But the images don’t really … 

HUIZINGA: Match … [LAUGHTER] 

BALI: … Match at all. So that’s where the whole Kahani thing started. And we did a hackathon project on it. And then a lot of people got interested. It’s an ongoing project, so I won’t say that it’s out there yet, but we are very excited about it, but because think of it, we can actually create stories for children, you know, which is what we started with, but we can create so much more media, so much more culturally appropriate storytelling, which is not necessarily targeted at children. 

HUIZINGA: Yeah, yeah. 

BALI: So that’s what Kahani is about. 

HUIZINGA: OK. And I saw a demo of it that your colleague did for Research Forum here, and there was an image of a girl—it was beautiful—and then there was a mask of some kind or a … what was that? 

BALI: So the mask is called Nazar Battu, which is actually, you have these masks which are supposed to drive away the evil eye. So that’s what the mask was about. It’s a very Indian thing. You know, when you build a nice house, you put one on top of it so that the envious glances are, like, kept at bay. So, yeah, so that’s what it was. 

HUIZINGA: And was there some issue of the generative AI not really understanding what that was? 

BALI: No, it didn’t understand what it was. 

HUIZINGA: So then can you fix that and make it more culturally aware? 

BALI: So that’s what we are trying to do for the image thing. So we have another project on culture awareness where we are looking at understanding how much generative AI knows about other cultures. 

HUIZINGA: Interesting. 

BALI: So that’s a simultaneous project that’s happening. But in Kahani, a lot of it is, like, trying to get reference images, you know … 

HUIZINGA: Yeah. … Into the system? 

BALI: Into the system … 

HUIZINGA: Gotcha … 

BALI: … and trying to anchor on that. 

HUIZINGA: Mmmm. So—and we’re not going to talk about that project, I don’t think—but … how do you assess whether an AI knows? By just asking it? By prompting and seeing what happens? 

BALI: Yeah, yeah, yeah. So in another project, what we did was, we asked humans to play a game to get cultural artifacts from them. The problem with asking humans what cultural artifacts are important to them is we don’t think of like things as culture, right. [LAUGHS] This is food! 

HUIZINGA: It’s just who we are! 

BALI: This is my food. Like, you know, it’s not a culturally important artifact. This is how I greet my parents. It’s not like culturally … 

HUIZINGA: So it’s just like fish swimming in water. You don’t see the water. 

BALI: Exactly. So we gamified this thing, and we were able to get certain cultural artifacts, and we tried to get generative AI models to tell us about the same artifacts. And it didn’t do too well … [LAUGHS] 

HUIZINGA: But that’s why it’s research! 

BALI: Yes! 

HUIZINGA: You try, you iterate, you try again … cool. As I mentioned earlier, I was a high school English teacher and an English major. I’m not correcting your grammar because it’s fantastic.

BALI: Thank you. 

HUIZINGA: But as a former educator, one of the projects I felt was really compelling that you’re working on is called Shiksha. It’s a copilot in education. Tell our audience about this.

BALI: So this is actually our proof of concept for the VeLLM platform. Since almost all of us were interested in education, we decided to go for education as the first use case that we’re going to work on. And actually, it was a considered decision to go target teachers instead of students. I mean, you must have seen a lot of work being done on taking generative AI to students, right. But we feel that, you know, teachers are necessary to teach because they’re not just giving you information about the subject. They’re giving you skills to learn, which hopefully will stay with you for a lifetime, right. And if we enable teachers, they will enable so many hundreds of students. One teacher can enable thousands of students, right, over her career. So instead of, like, going and targeting students, if we make it possible for teachers to do their jobs more effectively or, like, you know, help them get over the problems they have, then we are actually creating an ecosystem where things will scale really fast, really quickly. And in India, you know, this is especially true because the government has actually come up with some digital resources for teachers to use, but there’s a lot more that can be done. So we interviewed about a hundred-plus teachers across different parts of the country. And this is the, you know, discover part. 

HUIZINGA: Yeah! 

BALI: And we found out that lesson plans are a big headache! [LAUGHS] 

HUIZINGA: Yes, they are! Can confirm! 

BALI: Yeah. And they spend a lot of time doing lesson plans because they’re required to create a lesson plan for every class they teach … 

HUIZINGA: Sure. With learning outcomes … 

BALI: Exactly. 

HUIZINGA: All of it. 

BALI: All of it. So that’s where we, you know, zeroed in on—how to make it easier for teachers to create lesson plans. And that’s what the Shiksha project is about. You know, there is an enrollment process where the teachers say what subject they’re teaching, what classes they’re teaching, what boards, because there are different boards of education … 

HUIZINGA: Right … 

BALI: … which have different syllabus. So all that. But after that, it takes less than seven minutes for a teacher to create an entire lesson plan for a particular topic. You know, class assignments, class activities, home assignments, homework—everything! Like the whole thing in seven minutes! And these teachers have the ability to go and correct it. Like, it’s an interactive thing. So, you know, they might say, I think this activity is too difficult for my students. 

HUIZINGA: Yeah … 

BALI: Can I have, like, an easier one? Or, can I change this to this? So it allows them to interactively personalize, modify the plan that’s put out. And I find that really exciting. And we’ve tested this with the Sikshana Foundation, which works with teachers in India. We’ve tested this with them. The teachers are very excited and now Sikshana wants to scale it to other schools. 

HUIZINGA: Right … well, my first question is, where were you when I was teaching, Kalika? 

BALI: There was no generative AI! 

HUIZINGA: No. In fact, we just discovered the fax machine when I started teaching. Oh, that dates me! You know, back to what you said about teachers being instrumental in the lives of their students. You know, we can remember our favorite teachers, our best teachers. We don’t remember a machine. 

BALI: No.

HUIZINGA: And what you’ve done with this is to embody the absolute sort of pinnacle of what AI can do, which is to be the collaborator, the assistant, the augmenter, and the helper so that the teacher can do that inspirational, connective-tissue job with the students without having to, like, sacrifice the rest of their life making lesson plans and grading papers. Oh, my gosh. OK. On the positive side, we’ve just talked about what this work proposes and how it’s good, but I always like to dig a little bit into the potential unintended consequences and what could possibly go wrong if, in fact, you got everything right. So I’ll anchor this in another example. When GPT models first came out, the first reaction came from educators. It feels like we’re in a bit of a paradigm shift like we were when the calculator and the internet even came out. [It’s] like, how do we process this? So I want to go philosophically here and talk about how you foresee us adopting and moving forward with generative AI in education, writ large. 

BALI: Yeah, I think this is a question that troubles a lot of us and not just in education, but in all spheres that generative AI is … 

HUIZINGA: Art … 

BALI: … art … 

HUIZINGA: … writing … 

BALI: … writing … 

HUIZINGA: … journalism … 

BALI: Absolutely. And I think the way I, kind of, think about it in my head is it’s a tool. At the end of it, it is a tool. It’s a very powerful tool, but it is a tool, and humans must always have the agency over it. And we need to come up, as a society, you know, we need to come up with the norms of using the tool. And if you think about it, you know, internet, taking internet as an example, there is a lot of harm that internet has propagated, right. The darknet and all the other stuff that happens, right. But on the whole, there are regulations, but there are also an actual consensus around what constitutes the positive use of internet, right. 

HUIZINGA: Sure, yeah. 

BALI: Nobody says that, for example, deepfakes are … 

HUIZINGA: Mm-hmm. Good … 

BALI: … good, right. So we have to come from there and think about what kind of regulations we need to have in place, what kind of consensus we need to have in place, what’s missing. 

HUIZINGA: Right. Another project that has been around, and it isn’t necessarily on top of VeLLM, but it’s called Karya, and you call it a social impact organization that serves not just one purpose, but three. Talk about that. 

BALI: Oh, Karya is my favorite! [LAUGHS] So Karya started as a research project within Microsoft Research India, and this was the brainchild again of my colleague—I have like some of the most amazing colleagues, too, that I work with!—called Vivek Seshadri. And Vivek wanted to create, you know, digital work for people who do not have access to such work. So he wanted to go to the rural communities, to people who belong to slightly lower socioeconomic demographies, and provide work, like microtasks kind of work, gig work, to them. And he was doing this, and then we started talking, and I said, you know, we need so much data for all these languages and all these different tasks, and that could be, like, a really cool thing to try on Karya, and that’s where it all started, my involvement with Karya, which is still pretty strong. And Karya then became such a stable project that Microsoft Research India spun it out. So it’s now its own standalone startup right now like a social enterprise, and they work on providing digital work. They work on providing skills, like upskilling. They work on awareness, like, you know, making people aware of certain social, financial, other such trainings. So what’s been most amazing is that Karya has been able to essentially collect data for AI in the most ethical way possible. They pay their workers a little over the minimal wage. They also have something called data ownership practice, where the data that is created by, say, me, I have some sort of ownership on it. So what that means is that every time Karya sells a dataset, a royalty comes back … 

HUIZINGA: No … ! 

BALI: Yeah! To the workers. 

HUIZINGA: OK, we need to scale this out! [LAUGHS] OK. So to give a concrete example, the three purposes would be educational, financial—on their end—and data collection, which would ultimately support a low-resource language by having digital assets.

BALI: Absolutely! 

HUIZINGA: So you could give somebody something to read in their language … 

BALI: Yeah. 

HUIZINGA: … that would educate them in the process. They would get paid to do it, and then you would have this data. 

BALI: Yes! 

HUIZINGA: OK. So cool. So simple. 

BALI: Like I said, it’s my favorite project. 

HUIZINGA: I get that. I totally get that. 

BALI: And they … they’ve been, you know, they have been winning awards and things all over for the work that they’re doing right now. And I am very involved in one project with them, which is to do with gender-intentional AI, or gender-intentional datasets for AI, for Indian languages. And that’s really crucial because, you know, we talk about gender bias in datasets, etc., but all that understanding comes from a very Western perspective and for languages like English, etc. They do not translate very well to Indian languages. 

HUIZINGA: Right. 

BALI: And in this particular project, we’re looking at first, how to define gender bias. How do we even get data around gender bias? What does it even mean to say that technology is gender intentional? 

HUIZINGA: Right. All right, well, let’s talk a little bit about what I like to call outrageous ideas. And these are the ones that, you know, on the research spectrum from sort of really practical applied research to blue sky get dismissed or viewed as unrealistic or unattainable. So years ago—here’s a little story about you—when you told your tech colleagues that you wanted to work with the world’s most marginalized languages, they told you you’d only marginalize yourself. 

BALI: Yes! 

HUIZINGA: But you didn’t say no. You didn’t say no. Um, two questions. Did you feel like your own idea was outrageous back then? And do you still have anything outrageous yet to accomplish in this plan? 

BALI: Oh, yeah! I hope so! Yeah. No, I do think, in some sense, the pushback that I got for my idea makes me think it was outrageous. I didn’t think it was outrageous at all at that time! [LAUGHS] I thought it was a very reasonable idea! But there was a very solid pushback and not just from your colleagues. You know, for researchers, publishing papers is important! No one would publish a paper which focused only on, say, Indian languages or low-resource languages. We’ve come a very long way even in the research community on that, right. We kept pushing, pushing, pushing! And now, there are tracks, there are workshops, there are conferences which are devoted to multilingual and low-resource languages. When I said I wanted to work on Hindi, and Hindi is the biggest language in India, right. And even for that, I was told, why don’t you work on German instead? And I’m like, there are lots of people working on German who will solve the problems with German! Nobody is looking at Hindi! I mean, people should work on all the languages. People should work on German, but I don’t want to work on German! So there was a lot of pushback back then, and I see a little bit of that with the very low-resource languages even now. And I think some people think it’s a “feel-good” thing, whereas I think it’s not. I think it’s a very economically viable, necessary thing to build technology for these communities, for these languages. No one thought Hindi was economically viable 15 years ago, for whatever reason … 

HUIZINGA: That … that floors me … 

BALI: Yeah, but, you know, we’re not talking about tens of thousands of people in some of these languages; we’re talking about millions. 

HUIZINGA: Yeah. 

BALI: I still think that is a job that I need to continue, you know, pushing back on. 

HUIZINGA: Do you think that any of that sort of outrageous reaction was due to the fact that the technology wasn’t as advanced as it is now and that it might have changed in terms of what we can do? 

BALI: There was definitely the aspect of technology there that it was just quite difficult and very, very resource-intensive to build it for languages which did not have resources. You know, there was a time when we were talking about how to go about doing this, and because people in various big tech companies, people did not really remember a time when, for English, they had to start data collection from scratch because everyone who was working on, say, English at that time was building on what people had done years and years ago. So they could not even conceptualize that you had to start from scratch for anything, right. But now with the technology as well, I’m quite optimistic and trying to think of how cool it would be to do, you know, smaller data collections and fine-tuned models specifically and things like that, so I think that the technology is definitely one big thing, but economics is a big factor, too. 

HUIZINGA: Mmm-hmm. Well, I’m glad that you said it isn’t just the feel good, but it actually would make economic sense because that’s some of the driver behind what technologies get “greenlit,” as it were. Is there anything outrageous now that you could think of that, even to you, sounds like, oh, we could never do that … 

BALI: Well … I didn’t think HAL was outrageous, so I’m not … [LAUGHS] 

HUIZINGA: Back to HAL 9000! [LAUGHS] 

BALI: Yeah, so I don’t think of things as outrageous or not. I just think of things as things that need to get done, if that makes any sense? 

HUIZINGA: Totally. Maybe it’s, how do we override “Open the pod bay door, HAL”—“No, I’m sorry, Dave. I can’t do that”? [LAUGHS] 

BALI: Yes. [LAUGHS] Yeah… 

HUIZINGA: Well, as we close—and I’m sad to close because you are so much fun—I want to do a little vision casting, but in reverse. So let’s fast-forward 20 years and look back. How have the big ideas behind your life’s work impacted the world, and how are people better off or different now because of you and the teams that you’ve worked with? 

BALI: So the way I see it is that people across the board, irrespective of the language they speak, the communities they belong to, the demographies they represent, can use technology to make their lives, their work, better. I know it sounds like really a very big and almost too good to be true, but that’s what I’m aiming for. 

HUIZINGA: Well, Kalika Bali, I’m so grateful I got to talk to you in person. And thanks for taking time out from your busy trip from India to sit down with me and our audience and share your amazing ideas. 

[MUSIC PLAYS] 

BALI: Thank you so much, Gretchen.

[MUSIC FADES] 

The post Ideas: Language technologies for everyone with Kalika Bali appeared first on Microsoft Research.

Read More

Research Focus: Week of April 1, 2024

Research Focus: Week of April 1, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus April 1, 2024

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

In the same way that tools can help people complete tasks beyond their innate abilities, tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a surprisingly understudied question is how accurately an LLM uses tools for which it has been trained.

In a recent paper: LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, researchers from Microsoft find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate of 30% to 60%, which is too unreliable for practical use. They propose a biologically inspired method for tool-augmented LLMs – simulated trial and error (STE) – that orchestrates three key mechanisms: trial and error, imagination, and memory. STE simulates plausible scenarios for using a tool, then the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration. Experiments on ToolBench show STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.


Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

The latest LLMs have surpassed the performance of older language models on several tasks and benchmarks, sometimes approaching or even exceeding human performance. Yet, it is not always clear whether this is due to the increased capabilities of these models, or other effects, such as artifacts in datasets, test dataset contamination, and the lack of datasets that measure the true capabilities of these models.

As a result, research to comprehend LLM capabilities and limitations has surged of late. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. In a recent paper: MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks, researchers from Microsoft aim to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, Gemini, Gemma and Llama2) by comparing them on the same set of multilingual datasets. Their benchmark comprises 22 datasets covering 81 languages including several low-resource African languages. They also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.


Training Audio Captioning Models without Audio

Automated Audio Captioning (AAC) is a process that creates text descriptions for audio recordings. Unlike Closed Captioning, which transcribes speech, AAC aims to describe all sounds in the audio (e.g. : A muffled rumble with people talking in the background while a siren blares in the distance). Typical AAC systems require expensive curated data of audio-text pairs, which often results in a shortage of suitable data, impeding model training.

In this paper: Training Audio Captioning Models without Audio, researchers from Microsoft and Carnegie Mellon University propose a new paradigm for training AAC systems, using text descriptions alone, thereby eliminating the requirement for paired audio and text descriptions. Their approach leverages CLAP, a contrastive learning model that uses audio and text encoders to create a shared vector representation between audio and text. For instance, the text “siren blaring” and its corresponding audio recording would share the same vector. The model is trained on text captions: a GPT language decoder generates captions conditioned on the pretrained CLAP text encoder and a mapping network. During inference, audio input is first converted to its vector using the pretrained CLAP audio encoder and then a text caption is generated.

The researchers find that the proposed text-only framework competes well with top-tier models trained on both text and audio, proving that efficient text-to-audio conversion is possible. They also demonstrated the ability to incorporate various writing styles, such as humorous, beneficial for tailoring caption generation to specific fields. Finally, they highlight that enriching training with LLM-generated text leads to improved performance and has potential in increasing vocabulary diversity.

The post Research Focus: Week of April 1, 2024 appeared first on Microsoft Research.

Read More