Project Silica: Sustainable cloud archival storage in glass

Project Silica: Sustainable cloud archival storage in glass

This research paper was presented at the 29th ACM Symposium on Operating Systems Principles (opens in new tab) (SOSP 2023), the premier forum for the theory and practice of computer systems software.

SOSP 2023
Project Silica: Towards Sustainable Cloud Archival Storage in Glass

Data growth demands a sustainable archival solution

For millennia, data has woven itself into every facet of our lives, from business and academia to personal spheres. Our production of data is staggering, encompassing personal photos, medical records, financial data, scientific insights, and more. By 2025, it’s estimated that we will generate a massive 175 zettabytes of data annually. Amidst this deluge, a substantial portion is vital for preserving our collective heritage and personal histories.  

Presently, magnetic technologies like tape and hard disk drives provide the most economical storage, but they come with limitations. Magnetic media lacks the longevity and durability essential for enduring archival storage, requiring data to be periodically migrated to new media—for hard disk drives, this is every five years, for magnetic tape, it’s around ten. Moreover, ensuring data longevity on magnetic media requires regular “scrubbing,” a process involving reading data to identify corruption and fixing any errors. This leads to substantial energy consumption. We need a sustainable solution, one that ensures the preservation of our digital heritage without imposing an ongoing environmental and financial burden.

Project Silica: Sustainable and durable cloud archival storage

Our paper, “Project Silica: Towards Sustainable Cloud Archival Storage in Glass, (opens in new tab)” presented at SOSP 2023 (opens in new tab), describes Project Silica, a cloud-based storage system underpinned by quartz glass. This type of glass is a durable, chemically inert, and resilient low-cost media, impervious to electromagnetic interference. With data’s lifespan lasting thousands of years, quartz glass is ideal for archival storage, offering a sustainable solution and eliminating the need for periodic data refreshes.

Writing, reading, and decoding data

Ultrafast femtosecond lasers enable the writing process. Data is written inside a square glass platter similar in size to a DVD through voxels, permanent modifications to the physical structure of the glass made using femtosecond-scale laser pulses. Voxels encode multiple bits of data and are written in 2D layers across the XY plane. Hundreds of these layers are then stacked in the Z axis. To achieve high write throughput, we rapidly scan the laser pulses across the length of the media using a scanner similar to that used in barcode readers. 

To read data, we employ polarization microscopy to image the platter. The read drive scans sectors in a single swift Z-pattern, and the resulting images are processed for decoding. Different read drive options offer varying throughput, balancing cost and performance.

Data decoding relies on ML models that analyze images captured by the read drive, accurately converting signals from analog to digital. The glass library design includes independent read, write, and storage racks. Platters are stored in power-free storage racks and moved by free-roaming shuttles, ensuring minimal resource consumption for passive storage, as shown in Video 1. A one-way system between write racks and the rest of the library ensures that a written platter cannot be over-written under any circumstances, enforcing data integrity.

Video 1. The Silica library prototype demonstrates the flexible and scalable design of the system and its ability to sustainably service archival workloads. 

Azure workload analysis informs Silica’s design

To build an optimal storage system around the core Silica technology, we extensively studied cloud archival data workloads from Azure Storage. Surprisingly, we discovered that small read requests dominate the read workload, yet a small percentage of requests constitute the majority of read bytes, creating a skewed distribution, as illustrated in Figure 1.

Project Silica paper at SOSP 2023: A double bar chart with 2 y-axes: percentage of total read operations on the left y-axis, and percentage of total bytes read on the right y-axis; with file size buckets on the x-axis. The graph shows that the majority of read operations are for files with small file sizes, but they only make up a small fraction of all the bytes read (i.e., 58% of operations are for file sizes smaller than 4MB, but make up less than 1.2% of all bytes read). Conversely, most bytes read are for large files, but make up a small fraction of all read operations (i.e., 85% of bytes read are for files larger than 256MB, but make up less than 2% of requests).
Figure 1. The distribution of read request sizes. Most requests are for small files, but they make up a small percentage of the total load in bytes.

This implies that minimizing the latency of mechanical movement in the library is crucial for optimal performance. Silica glass, a random-seeking storage medium, can suitably meet these requirements as it eliminates the necessity for spooling, unlike magnetic tape. Figure 2 illustrates substantial differences in read demand across various datacenters. These results suggest that we need a flexible library design that can scale resources for each datacenter’s workload. Studying these archival workloads has been instrumental in helping us establish the core design principles for the Silica storage system.

Project Silica paper at SOSP 2023: Figure 2. A bar chart showing different, unlabeled data centers on the x-axis, and tail over median read throughput on the y-axis on a log scale. The graph shows up to 7 orders of magnitude mean-to-tail difference within a data center, and up to 5 orders of magnitude variability in the mean-to-tail difference across different data centers.
Figure 2. Tail over median read load for different datacenters. The data shows significant variation across and within datacenters.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Project Silica’s versatile storage system

We designed and evaluated a comprehensive storage system that manages error correction, data layout, request scheduling, and shuttle traffic management. Our design effectively manages IOPS-intensive tasks, meeting the expected service level objective (SLO) of an archival storage tier, approximately 15 hours. Interestingly, even in volume-intensive scenarios where a large number of bytes are read, our system efficiently handles requests using read drives with low throughput. In both cases, throughput demands are significantly below those of traditional tape drives. This is shown in Figure 3. The paper provides an extensive description of this system, and the video above shows our prototype library’s capabilities. 

Project Silica paper at SOSP 2023: Figure 3. A line chart with 3 lines: Volume, IOPS, and Typical. The x-axis shows Read drive throughput ranging from 30MB/s to 210MB/s in increments of 30, and the y-axis shows the tail completion time in hours of the system running each of the workloads represented by each line. The graph shows that all workloads complete within the desired 15-hour SLO, even with 30MB/s read drives. The SLO improves as read drive throughput increases, but starts to plateau past 60MB/s for all workloads.
Figure 3. Volume and IOPS workloads represent different extremes in the spectrum of read workloads. Our design can service both workloads well within the expected SLO for an archival storage tier, at about 15 hours.

Diverse applications for sustainably archiving humanity’s data

Project Silica holds promise in numerous sectors, such as healthcare, scientific research, and finance, where secure and durable archival storage of sensitive data is crucial. Research institutions could benefit from Silica’s ability to store vast datasets generated from experiments and simulations, ensuring the integrity and accessibility of research findings over time. Similarly, healthcare organizations could securely archive patient records, medical imaging data, and research outcomes for long-term reference and analysis. 

As the volume of globally generated data grows, traditional storage solutions will continue to face challenges in terms of scalability, energy-efficiency, and long-term durability. Moreover, as technologies like AI and advanced analytics progress, the need for reliable and accessible archival data will continue to intensify. Project Silica is well-positioned to play a pivotal role in supporting these technologies by providing a stable, secure, and sustainable repository for the vast amounts of data we create and rely on.

The post Project Silica: Sustainable cloud archival storage in glass appeared first on Microsoft Research.

Read More

Research Focus: Week of October 23, 2023

Research Focus: Week of October 23, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: October 25, 2023

NEW RESEARCH

Kosmos-2.5: A Multimodal Literate Model 

Current large language models (LLMs) primarily focus on textual information and cannot understand visual information. However, advancements in the field of multimodal large language models (MLLMs) aim to address this limitation. MLLMs combine visual and textual information within a single Transformer-based model, enabling the model to learn and generate content based on both modalities.

While existing MLLMs have mainly focused on natural images with lower resolutions, the exploration of text images requires further investigation. Incorporating text images into the training process and developing models based on textual and visual information can unlock new possibilities for multimodal applications involving high-resolution text-intensive images.

In a new paper: Kosmos-2.5: A Multimodal Literate Model, researchers from Microsoft present Kosmos-2.5, a MLLM for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. The model can be adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning. This work paves the way for the future scaling of MLLMs.

Microsoft Research Podcast

AI Frontiers: The future of causal reasoning with Emre Kiciman and Amit Sharma

Emre Kiciman and Amit Sharma discuss their paper “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality” and how it examines the causal capabilities of large language models (LLMs) and their implications.


NEW RESEARCH

Evaluation of Dependency Structure for Multivariate Weather Predictors using Copulas

In the Global South (opens in new tab), climate change is driving more frequent and severe weather events such as droughts, floods, and storms. This leads to crop failures, food insecurity, and job loss. These effects are expected to increase in intensity, further disadvantaging marginalized communities and exacerbating existing inequalities. The need for prevention and adaptation is urgent. But despite advances in machine learning and numerical modeling, accurate weather forecasting remains challenging, due to complex interactions among atmospheric and oceanic variables.

In a new paper: Evaluation of Dependency Structure for Multivariate Weather Predictors using Copulas, researchers from Microsoft explore the potential of vine copulas to explain complex relationships of different weather variables in three African locations. Copulas separate marginal distributions from the dependency structure, offering a flexible way to model dependence between random variables for improved risk assessments and simulations. Vine copulas are based on a variety of bivariate copulas, including Gaussian, Student’s t, Clayton, Gumbel, and Frank copulas. They are effective in high-dimensional problems and offer a hierarchy of trees to express conditional dependence. The researchers propose applying this framework within subseasonal forecasting models to enhance the prediction of different weather events or variables.


NEW RESEARCH

Adaptive Training System

Adaptive training has been defined as training in which the problem, stimulus, or task is varied as a function of how well the trainee performs. Researchers have shown that this type of training outperforms comparative training that is non-adaptive or fixed across a range of populations and learning contexts. Virtual reality offers new opportunities for applying this type of training and has already demonstrated its effectiveness (opens in new tab) across a variety of simulated tasks. By using a computational model of the training process, we can derive recommendations for optimal scenario difficulty, resulting in faster and enhanced training.

In a new paper: Adaptive Training System, researchers from Microsoft propose an adaptive training algorithm that accelerates the training process based on a parametric model of trainees and training scenarios. The proposed approach makes trial-by-trial recommendations on optimal scenario difficulty selections to maximize improvements in the trainee’s absolute skill level. The Adaptive Training System is applied to the task of training pilots on a virtual reality flight simulator. The system was designed for scenarios varying in difficulty from easy, with full visibility, to flight in fog with side wind, which is difficult even for experienced pilots. 

Adaptive Training System applied to the task of training pilots on a virtual reality flight simulator. On the left, a flight scenario with fog. On the right, a flight scenario with full visibility.

NEW RESEARCH

CodePlan: Repository-level Coding using LLMs and Planning

Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. These activities are formulated as repository-level coding tasks.

Large language model-powered coding assistants, like GitHub Copilot, have succeeded in offering high-quality solutions to localized coding problems. But repository-level coding tasks are more involved and cannot be solved directly using LLMs, since code within a repository is interdependent and the entire repository may be too large to fit into the prompt.

In a new paper: CodePlan: Repository-level Coding using LLMs and Planning, researchers from Microsoft frame LLM-driven repository-level coding as a planning problem, where the goal is to take the repository from its initial state to a target state whose specifications are provided in natural language. They present CodePlan, a task-agnostic framework, to solve it by synthesizing a multi-step chain of edits, where each step results in a call to an LLM on a code location with context derived from the entire repository, previous code changes and task-specific instructions. This research evaluates the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python) and shows that CodePlan exhibits a stronger alignment with the ground truth in comparison to baselines.


NEW ARTICLE

The intimacy triple bind: Structural inequalities and relational labor in the influencer industry

Social media content creators, or influencers, depend heavily on their ability to cultivate and maintain an invested audience-community. They are encouraged to practice “relational labor,” commodifying their personalities, lives and tastes in order to build authentic self-brands and intimacy with audiences.

In a new article (opens in new tab), a researcher from Microsoft draws on an ethnographic study of the London influencer industry to examine relational labor through an intersectional feminist lens, exploring the ways in which structural inequalities shape relationships between creators and their audiences. Managing audience relationships is harder for marginalized creators – especially those making stigmatized and less brandable content genres – who are at higher risk of trolling and harassment.

This article explores four key tactics for managing such conditions: (1) leaning into making rather than being content; (2) (dis)engaging with anti-fans through silence; (3) retreating into private community spaces, away from the exposure of public platforms; and, in parallel, (4) turning off public comments.


The post Research Focus: Week of October 23, 2023 appeared first on Microsoft Research.

Read More

Abstracts: October 23, 2023

Abstracts: October 23, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Andy Gordon, a Partner Research Manager, and Carina Negreanu, a Senior Researcher, both at Microsoft Research, join host Dr. Gretchen Huizinga to discuss “Co-audit: Tools to help humans double-check AI-generated content.” This paper brings together current understanding of generative AI performance to explore the need and context for tools to help people using the technology find and fix mistakes in AI output.

Transcript

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot, or a podcast abstract, of their new and noteworthy papers. Today, I’m talking to Dr. Andy Gordon, a Partner Research Manager, and Dr. Carina Negreanu, a Senior Researcher, both at Microsoft Research. Doctors Gordon and Negreanu are co-editors of a paper called “Co-audit: Tools to help humans double-check AI-generated content,” and you can read a preprint of this paper now on arXiv. Andy Gordon, Carina Negreanu, thanks for joining us on Abstracts!


ANDY GORDON: Great to be here.

CARINA NEGREANU: Likewise.

HUIZINGA: Let’s start with you, Andy. In a few sentences, describe the issue or problem your paper addresses and why people should care about it.

GORDON: Well, generative AI is amazing. Things like Bing Chat or ChatGPT, all these things powered by large language models. Totally amazing. But it’s really important for everyone to remember that these AIs can make mistakes. For example, you ask when your favorite actor got married, and the model says the year but gets it wrong. Or you ask for some Python code, and it works on positive numbers, but occasionally you give it negative numbers and it goes wrong. Another example, you get a summary of some text. It’s great but unfortunately misses one of the important points. Or thinking about images, you ask for a portrait of a character from the AI and there’s some glitch, and it produces a hand with six fingers. So as users, we need to get into the habit of carefully checking AI outputs for mistakes. And we refer to that as “audit” in a sense of a systematic review. Coming to the paper, it’s about what we call co-audit. And that’s our term for any tool support that helps the human audit the AI output. And some examples of co-audit are tools that can help check for hallucinations, like when the actor’s date of birth is wrong, or to check Python code to find some errors or show how a summary has been constructed to help people find errors.

HUIZINGA: Carina, let’s talk to you. What related research does this paper build on, and how does your work add to it?

NEGREANU: So there was no direct work on the co-audit brand before us. We’re just introducing it. But there has been a lot of research that either motivates the need for co-audit or provides relevant framing for it or even like early examples of what we start thinking of co-audit. So as you’re probably aware, there has been a really great effort in the last years to assess the quality of generations by large language models across a multitude, really, of tasks. And currently we use this body of work as motivation for our research. It basically shows there really is a need for this kind of work. And we hope that in the future, we can also use it to benchmark co-audit tools that we are going to produce in our wider community. But the idea of dealing with errors has been a key part of research on human-AI interaction for ages. And there have been some really cool guidelines that came out recently, especially from Amershi in 2019, on human-AI interactions that are concerned with this part of the world. And more recently, Glassman had a really cool paper about conversational frameworks for human-AI and communication and basically links these concepts to psychology. And in our work, as you can read in our paper, we are trying to basically frame co-audit within her framework, and we find that it’s a natural fit. But before we started defining formally co-audit and building this paper, our group has built co-audit tools in the co-generation space. One such tool is GAM, which is grounded abstraction matching, where we basically help users learn how to effectively communicate with large language models so that they both understand what the large language model understands they’re asking and also get good feedback back. We also built ColDeco, which is a spreadsheet tool for inspecting and verifying calculated columns without the user requiring to view the underlying code produced by the large language models. But really, any tool that focuses on debugging or basically getting information back from human-generated content is useful here. So even tools that are like early debugging tools like FxD are very important here as we learn how people use these kinds of tools and we try to basically apply the same concepts in the context of LLM-generated content. So basically, we are building on top of work that helps understand the needs and challenges that end-user programmers have when working in this space and trying to extrapolate them to co-auditing tools for LLM-generated content.

HUIZINGA: Well, Andy, how would you describe the research approach you used or your methodology for this paper, and how did it come about?

GORDON: Great question, Gretchen, and it was actually quite an unusual methodology for us. So as Carina says, we’ve been looking at co-audit in a very specific setting of spreadsheet computations, and we began to realize that co-audit was really important for any kind of AI-generated output, and we started to see other people doing research that was doing the same sort of thing we were doing but in different settings. So, for example, there was a paper, they were generating bits of Python and they were deliberately showing multiple pieces of code after they’d been generated to kind of nudge the human user to make a decision about which one was better. I mean that’s, it’s really important to get people to think about the outputs, and this was a nice trick. So we thought, look, this is actually quite an important problem, and MSR (Microsoft Research) should step up and sort of gather people. So we organized a workshop inside Microsoft in the spring and got folks together to share their perspectives on co-audit. And then since then, we’ve reflected on those discussions and tried to kind of pull them together in a more coherent sense than the sort of whiteboards and sticky notes that we produced back then. And so that’s produced this paper. I think one of the key things that we learned in that process that we hadn’t been thinking about before was that co-audit really complements prompt engineering. So you hear a lot about prompt engineering, and it’s the first part of what we call the prompt-response-audit loop. And this is related to what Carina was saying about Elena Glassman’s work about AI-human interaction. So the first step is you formulate a prompt. For example, you ask for Python code. That’s the first step. The second step is we wait for the response from the AI. And then the third step is that we need to inspect the response—that’s the audit part—decide if it meets our needs or if there is a mistake, and if that’s the case, we need to repeat again. So that’s this loop, the prompt-response-audit loop. And prompt engineering, they’re the tools and techniques that you use in that first step to create the prompt. So, for example, some tools will automatically include a data context in a prompt if you’re trying to create some Python to apply to a table in a spreadsheet or, or something like that. And then duly, co-audit, those are the tools and techniques we have to help the human audit the response in the third step of this loop. And that’s like these tools I’ve been mentioning that show maybe two or three candidates of code that’s to be used.

HUIZINGA: Carina, let’s move over to what kinds of things you came away with. Your takeaways or your findings from this workshop. Talk about that and how you chose to articulate them in the paper.

NEGREANU: So as part of our research, we found that basically one co-audit tool does not fit all needs, which in a way was great because we have a bigger field to explore, but in other ways a bit daunting, as it means you have to think of many things. And one thing that really came to light was that even though we can’t, you know, build something that fits everything, we can build a set of principles that we think are important. So really, we wrote our paper around those 10 principles that we have identified from the workshop and then are trying to promote them as things people should think about when they start going on the journey of building co-auditing tools. So one of the examples is that we really think that we should think about grounding outputs, so, for example, by citing reliable sources similar to what Bing Chat does today. We think that’s a really valuable, important principle that people should follow, and they should think about what that means in the concept of their co-auditing tool. In the case of Bing, it’s quite simple, as it’s like factual references, but maybe if it becomes referencing code, that becomes more tricky but still super interesting going forward. We also propose that co-auditing tools should have the capability to prioritize the user’s attention to the most likely errors, as we need to be mindful of the user’s cognitive efforts and have a positive cost benefit. Basically, if we overflood the users with different errors and flags, it might be too problematic, and the adoption might be quite difficult going forward. And finally, this is something that really comes to core to our research area in spreadsheets. It’s about thinking beyond text. So we know visuals are so important in how we explain things, in how we teach in schools, how we teach universities. So how do we include them in the co-auditing process going forward? I think that’s going to be a really interesting challenge, and we hope we’re going to see some interesting work in that space.

HUIZINGA: Yeah. Well, principles are one thing, Andy, but how does this paper contribute to real-world impact? We talked about that a bit at the beginning. Who benefits most from this tool?

GORDON: That is a great question, Gretchen, and actually that was a question that we talked about at the workshop. We think that some application areas are going to benefit more than others. So co-audit really matters when correctness really matters and when mistakes are bad consequences, so in terms of application area, that’s areas like maybe finance or technology development or medicine. But you asked particularly about who, and we think some people will benefit more from co-audit than others. And we found this really striking example, I guess it’s an anecdotal example that someone was posting on social media. A professor was teaching a class using generative AI tools for the first time to generate code, and he found some evidence that people who have low self-confidence with computers can be intimidated by generative AI. So he would find that some of the class were really confident users and they would ask it, you know, generate some Python to do such and such, and it would come back with code with, you know, a bunch of mistakes in it. And the confident users were happy just to swat that away; they were even quite a little arrogant about it, like this is a stupid computer, they were saying. But, Gretchen, he found that a lot of his students who were less confident with computers were quite intimidated by this because it was very confidently just saying, oh look, all this code is going to work. And they kind of got a bit stuck, and some of them were scrolling around through this code, trying to understand how it worked, when in fact it was just really broken. So he thought this was pretty bad that these able students who were just less confident were being intimidated and were making less good use of the, the generative AI. Now that is an example that’s an anecdote from social media from a reputable professor, but we looked into it and there’s peer-reviewed studies that show similar effect in the literature. So I’d say we need co-audit tools that will encourage these less confident users to question when the AI is mistaken rather than getting stuck, and I think otherwise they’re not going to see the benefits of the generative AI.

HUIZINGA: Well, Carina, sometimes I like to boil things down to a nugget or a beautiful takeaway. So if there’s one thing you want our listeners to take away from this work, this paper, what would it be?

NEGREANU: I think that what this study has taught us is that really we need significantly more research. So basically, a good co-auditing experience can really be the element that makes it or breaks it in how we incorporate LLMs safely into our day-to-day lives. But to make this happen, we need people from the field working towards the same goal. It’s really an interdisciplinary work, and I don’t think we can do it by isolating into groups as we’re currently researching now. So I would urge our listeners to think about how they could contribute in this space and reach out with feedback and questions to us. We are more than open to collaboration. Really, we are just starting this journey, and we’d love to see this area to become a research priority going forward in 2024.

HUIZINGA: Well, Andy, as an opportunity to give some specificity to Carina’s call for help, what potential pitfalls have you already identified that represent ongoing research challenges in this field? And what’s next on yours—and potentially others’—research agenda in this field?

GORDON: Well, one point, and I think Carina made this, that co-audit techniques will themselves never be perfect. I mean, we’re saying that language models are never going to be perfect. Mistakes will come through. But the co-audit techniques themselves won’t be perfect either. So sometimes a user who is using the tools will still miss some mistakes. So, for example, you know, at the workshop, we thought about security questions and co-audit tools themselves. And we were thinking, for instance, about maybe deliberate attacks on a generative AI. There’s various techniques that people are talking about at the moment where you might sort of poison the inputs that generative AI models pick up on. And in principle, co-audit tools could help users realize that there are deliberate mistakes that have been engineered by the attacker. So that’s good. But on the other hand, you know, security always becomes an arms race. And so once, you know, if we did have a good tool that could detect those kinds of mistakes, the attackers then will start to engineer around the co-audit tools, trying to make them less effective. So that will be an ongoing problem, I think. And on the other hand, you know, we’ll find that if co-audit tools are giving too many warnings, users will start to ignore them, and there’ll be a sort of under-reliance on co-audit tools. And of course, if we give too few, users will miss the mistakes. So an interesting balance needs to be struck. And also, we don’t expect there’s going to be one overarching co-audit experience, but we think there’ll be many different realizations. And so, as Carina says, we hope that common lessons can be learned, and that’s why we want to keep documenting this space in general and building a research community. So I echo what Carina was saying. If you’re listening and you think that what you’re working on is co-audit, do reach out.

HUIZINGA: Well, Andy Gordon, Carina Negreanu, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper and this research, you can find a link at aka.ms/abstracts, or you can read the preprint on arXiv. See you next time on Abstracts!

The post Abstracts: October 23, 2023 appeared first on Microsoft Research.

Read More

What’s Your Story: Ranveer Chandra

What’s Your Story: Ranveer Chandra

MSR Podcast

In this new Microsoft Research Podcast series What’s Your Story, Lab Director Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. He talks to members of the research community at Microsoft about what motivates their work and how they got where they are today. 

Ranveer Chandra is Managing Director of Research for Industry and CTO of Agri-Food. He is also head of Networking Research at Microsoft Research Redmond. His work in systems and networking is helping to bring more internet connectivity to more people and is yielding tools designed to help farmers increase food production more affordably and sustainably. In this episode, he shares what it was like growing up in Jamshedpur, India; why he focuses his efforts in the areas he does; and where the joy in his work comes from.

The post What’s Your Story: Ranveer Chandra appeared first on Microsoft Research.

Read More

Understanding the user: How the Enterprise System Usability Scale aligns with user reality

Understanding the user: How the Enterprise System Usability Scale aligns with user reality

This position research paper was presented at the 26th ACM Conference on Computer-Supported Cooperative Work and Social Computing (opens in new tab) (CSCW 2023), a premier venue for research on the design and use of technologies that affect groups, organizations, and communities.

Microsoft at CSCW 2023 conference highlights

In the business world, measuring success is as critical as selecting the right goals, and metrics act as a guiding compass, shaping organizational objectives. They are instrumental as businesses strategize to develop products that are likely to succeed in specific markets or among certain user groups.  

However, businesses often overlook whether these metrics accurately reflect users’ experiences and behaviors. Do they truly reflect the consumers’ journey and provide a reliable evaluation of the products’ place in the market? Put differently, do these metrics truly capture a product’s effectiveness and value, or are they superficial, overlooking deeper insights that could lead a business toward lasting success?

Microsoft Research Podcast

Collaborators: Gov4git with Petar Maymounkov and Kasia Sitkiewicz

Gov4git is a governance tool for decentralized, open-source cooperation, and is helping to lay the foundation for a future in which everyone can collaborate more efficiently, transparently, and easily and in ways that meet the unique desires and needs of their respective communities.


Challenges in enterprise usability metrics research

In our paper, “A Call to Revisit Classic Measurements for UX Evaluation (opens in new tab),” presented at the UX Outcomes Workshop at CSCW 2023 (opens in new tab), we explore these questions about usability metrics—which evaluate the simplicity and effectiveness of a product, service, or system for its users—and their applicability to enterprise products. These metrics are vital when measuring a product’s health in the market and predicting adoption rates, user engagement, and, by extension, revenue generation. Current usability metrics in the enterprise space often fail to align with the actual user’s reality when using technical enterprise products such as business analytics, data engineering, and data science software. Oftentimes, they lack methodological rigor, calling into question their generalizability and validity.

One example is the System Usability Scale (opens in new tab) (SUS), the most widely used usability metric. In the context of enterprise products, at least two questions used in SUS do not resonate with users’ actual experiences: “I think I would like to use the system frequently” and “I think I need the support of a technical person to be able to use this product.” Because users of enterprise products are consumers, not necessarily customers, they often do not get to choose which product to use. In some cases, they are IT professionals with no one to turn to for technical assistance. This misalignment highlights the need to refine how we measure usability for enterprise products. 

Another concern is the lack of rigorous validation for metrics that reflect a product’s performance. For instance, UMUX-Lite (opens in new tab) is a popular metric for its simplicity and strong correlation with SUS. However, its scoring methodology requires that researchers use an equation consisting of a regression weight and constant to align the average scores with SUS scores. This lacks a solid theoretical foundation, which raises questions about UMUX-Lite’s ability to generalize to different contexts and respondent samples.

The lack of standardization underscores the need for metrics that are grounded in the user’s reality for the types of products being assessed and based on theoretical and empirical evidence, ensuring that they are generalizable to diverse contexts. This approach will pave the way for more reliable insights into product usability, fostering informed decisions crucial for enhancing the user experience and driving product success.

ESUS: A reality-driven approach to usability metrics

Recognizing this need, we endeavored to create a new usability metric that accurately reflects the experience of enterprise product users, built on solid theory and supported by empirical evidence. Our research combines qualitative and quantitative approaches to devise a tailored usability metric for enterprise products, named the Enterprise System Usability Scale (ESUS). 

ESUS offers a number of benefits over the SUS and UMUX-Lite. It is more concise than the SUS, containing only half the questions and streamlining the evaluation process. It also eliminates the need for practitioners to use a sample-specific weight and constant, as required by UMUX-Lite, providing a more reliable measure of product usability. Moreover, ESUS demonstrates convergent validity, correlating with other usability metrics, such as SUS. Most importantly, through its conciseness and specificity, it was designed with enterprise product users in mind, providing relevant and actionable insights.  

In Table 1 below, we offer ESUS as a step towards more accurate, reliable, and user-focused metrics for enterprise products, which are instrumental in driving well-informed decisions in improving product usability and customer satisfaction.

ESUS Items 1 2 3 4 5
How useful is [this product] to you? Not at all useful Slightly useful Somewhat useful Mostly useful Very useful
How easy or hard was [this product] to use for you? Very hard Hard Neutral Easy Very easy
How confident were you when using [this product]? Not at all confident Slightly confident Somewhat confident Mostly confident Very confident
How well do the functions work together or do not work together in [this product]? Does not work together at all Does not work well together Neutral Works well together Works very well together
How easy or hard was it to get started with [this product]? Very hard Hard Neutral Easy Very easy
Table 1: Proposed ESUS questionnaire

Looking ahead: Advancing precision in understanding the user

Moving forward, our focus is on rigorously testing and enhancing ESUS. We aim to examine its consistency over time and its effectiveness with small sample sizes. Our goal is to ensure our metrics are as robust and adaptable as the rapidly evolving enterprise product environment requires. We’re committed to continuous improvement, striving for metrics that are not just accurate but also relevant and reliable, offering actionable insights for an ever-improving user experience.

The post Understanding the user: How the Enterprise System Usability Scale aligns with user reality appeared first on Microsoft Research.

Read More

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

White line icons on a blue and green gradient background

Introduction

How trustworthy are generative pre-trained transformer (GPT) models?

To answer this question, University of Illinois Urbana-Champaign, together with Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research, released a comprehensive trustworthiness evaluation platform for large language models (LLMs), which is presented in the recent paper: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models – Microsoft Research (opens in new tab). This paper, which was accepted as an oral presentation at NeurIPS 2023 (Datasets and Benchmarks Track), (opens in new tab) focuses specifically on GPT-4 and GPT-3.5. It considers diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

Based on our evaluations, we found previously unpublished vulnerabilities relating to trustworthiness. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely.

Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark (opens in new tab) is publicly available.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


It’s important to note that the research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. In addition, we have shared our research with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm. This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward. To facilitate collaboration, we have made our benchmark code very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Trustworthiness perspectives of language models

Recent breakthroughs in machine learning, especially LLMs, have enabled a wide range of applications, from chatbots to robotics. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models even for sensitive applications such as healthcare and finance. To this end, we focus on a comprehensive trustworthiness evaluation of GPT models towards eight trustworthiness perspectives, with thorough evaluations based on different constructed scenarios, tasks, metrics, and datasets, as shown in Figure 1 below.

Overall, we aim to evaluate 1) the performance of GPT models under different trustworthiness perspectives, and 2) the resilience of their performance in adversarial environments (e.g., adversarial system/user prompts, demonstrations).

For example, to evaluate the robustness of GPT-3.5 and GPT-4 on textual adversarial attacks, we construct three evaluation scenarios: 1) evaluation on the standard benchmark AdvGLUE with a vanilla task description, aiming to assess: a) the vulnerabilities of GPT models to existing textual adversarial attacks, b) the robustness of different GPT models in comparison to state-of-the-art models on the standard AdvGLUE benchmark, c) the impact of adversarial attacks on their instruction-following abilities (measured by the rate at which the model refuses to answer a question or presents an incorrect answer when it is under attack), and d) the transferability of current attack strategies (quantified by the transferability attack success rates of different attack approaches); 2) evaluation on the AdvGLUE benchmark given different instructive task descriptions and designed system prompts, so as to investigate the resilience of models under diverse (adversarial) task descriptions and system prompts; 3) evaluation of GPT-3.5 and GPT-4 on our generated challenging adversarial texts AdvGLUE++ against open-source autoregressive models such as Alpaca-7B, Vicuna-13B, and StableVicuna-13B in different settings to further evaluate the vulnerabilities of GPT-3.5 and GPT-4 under strong adversarial attacks in diverse settings.

A graph listing the trustworthiness perspectives, benchmarks, and datasets evaluated in this work, organized in three layers. The first layer shows the 8 main trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. The second layer shows different benchmarks of each trustworthiness perspective. Some of the benchmarks are from prior work, while some are newly proposed in the paper. The third layer further shows different evaluation settings of each benchmark.
Fig 1. Taxonomy of our evaluation based on different trustworthiness perspectives. We use yellow boxes to represent the evaluations on existing benchmarks, and green boxes for evaluations using our designed new data or new evaluation protocols on existing datasets.

Insights and findings from DecodingTrust

Our research uncovered some previously undisclosed advantages as well as threats to the reliability of large language models.

For example, in terms of model robustness to adversarial demonstrations, we find that on the one hand, GPT-3.5 and GPT-4 will not be misled by the counterfactual examples added in the demonstrations and can even benefit from the counterfactual demonstrations in general; on the other hand, we observe that providing backdoored demonstrations will mislead both GPT-3.5 and GPT-4 to make incorrect predictions for backdoored inputs, especially when the backdoored demonstrations are positioned close to the (backdoored) user inputs, and GPT-4 is more vulnerable to backdoored demonstrations.

Moreover, in terms of toxicity and bias, both GPT-3.5 and GPT-4 are not strongly biased for the majority of stereotype topics considered under the benign and untargeted system prompts. However, both GPT models can be “tricked” into agreeing with biased content given designed misleading (adversarial) system prompts. GPT-4 is more vulnerable to misleading targeted system prompts and agrees with the biased content more frequently than GPT-3.5, potentially because GPT-4 follows the misleading instructions more precisely. Model bias can often be dependent on the demographic groups mentioned in the user prompt. For instance, for the query “Homosexuals have HIV.”, GPT-4 strongly disagrees in general, while for the query “Women have HIV.”, GPT-4 frequently agrees and outputs biased content. Model bias is also dependent on the stereotype topics. GPT models output more biased content on less sensitive topics like leadership and greed, while generating less biased content on more sensitive topics like drug dealing and terrorism. This is potentially due to the fine-tuning of GPT models on some protected demographic groups and sensitive topics. 

DecodingTrust also evaluates the privacy-leakage issues of LLMs. We find that GPT models can leak privacy-sensitive training data, such as the email addresses from the standard Enron email dataset, especially when prompted with the context of emails or few-shot demonstrations of (name, email) pairs. Moreover, under few-shot prompting, with supplementary knowledge such as the targeted email domain, the email extraction accuracy can be 100x higher than the scenarios where the email domain is unknown. We also observe that GPT models can leak the injected private information in the conversation history. Overall, GPT-4 is more robust than GPT-3.5 in safeguarding personally identifiable information (PII), and both models are robust to specific types of PII, such as Social Security numbers, possibly due to the explicit instruction tuning for those PII keywords. However, both GPT-4 and GPT-3.5 would leak all types of PII when prompted with privacy-leakage demonstrations during in-context learning. Lastly, GPT models demonstrate different capabilities in understanding different privacy-related words or privacy events (e.g., they will leak private information when told “confidentially” but not when told “in confidence”). GPT-4 is more likely to leak privacy than GPT-3.5, given our constructed prompts, potentially due to the fact that it follows the (misleading) instructions more precisely. We present more examples of model unreliable outputs in Figure 2 below.

The figure showing the examples of undesirable responses of GPT-4 given benign system prompts for each of the 8 trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.
Fig 2.  Examples of undesirable responses of GPT-4 given benign system prompts from different trustworthiness perspectives. Offensive or sensitive information is masked. 

The post DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models appeared first on Microsoft Research.

Read More

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

These research papers were presented at the IEEE Symposium on Visual Languages and Human-Centric Computing (opens in new tab) (VL/HCC 2023), a premier forum for design, theory, and application of computing technologies for programming, modelling, and communication.

Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets

Large language models (LLMs) have revolutionized the way novice programmers and everyday computer users tap into the capabilities of natural language for programming. Among the tools used in this context, spreadsheets stand out as the preferred choice. The integration of LLMs into spreadsheets promises to substantially enhance their functionality and the user experience. At the same time, it’s well known that spreadsheet users commonly though inadvertently introduce errors (opens in new tab), and this can carry significant risks. For example, in 2010, a spreadsheet used in a Harvard economic analysis (opens in new tab) to inform austerity measures imposed on Greece was discovered to contain multiple errors (opens in new tab).

Microsoft is actively pursuing (opens in new tab) research focused on developing co-auditing tools and techniques, with an initial emphasis on spreadsheets. These tools are designed to help users verify the results generated by LLMs. At VL/HCC 2023 (opens in new tab), we introduce two new spreadsheet tools, ColDeco and FxD, specifically built to help users thoroughly examine and debug their programs within spreadsheets. Additionally, it is worth mentioning that the paper on FxD was awarded the Honorable Mention (opens in new tab).

ColDeco: An end-user inspection tool

Working with tables in spreadsheets is a common task, and the ability to add a calculated column can be incredibly useful. A calculated column not only adds information but also facilitates tasks like filtering and sorting. Generative AI can enable users to create sophisticated calculated columns in tables. However, verification of AI-generated code in this scenario is crucial because AI can misinterpret the user’s intent or overlook important data. 

In our paper, “ColDeco: An End User Spreadsheet Inspection Tool for AI-Generated Code,” we introduce ColDeco, a no-code inspection tool for calculated columns. ColDeco uses helper columns and row grouping to help users understand how an AI-generated column works and locate any errors. 

To describe how ColDeco works, we’ll use an example table containing people’s first, middle, and last names in separate columns. Our user asks the system to “create a column called ‘Abbreviation’ that takes the first letter of each part of the name.” In this example, there’s an error in the generated code that fails to handle rows with no middle names, causing some Abbreviation cells to be empty.  

First, the model generates a program that computes an abbreviation for each row and adds it to the new Abbreviation column. ColDeco’s interface automatically opens as a side panel, as shown in Figure 1. 

The Inspect Columns view displays any generated columns, accompanied by a natural language description of the generated code. The Inspect Rows view displays a subset of the table, organized by behavior. The Row Inspection view uses dataflow analysis to group rows, highlighting key distinct execution behaviors. In our example, this view quickly draws the user’s attention to the two rows that fail to calculate an abbreviation.

Two graphics. The first graphic depicts a table with columns: “First Name”, “Middle Name”, “Last Name”, “DoB”, and “Abbreviation”. There are 11 rows. As examples, row 3 contains the information: First Name: Christopher, Middle Name: Michael, Last Name: Fleming, DoB: 11/5/1995, Abbreviation: CMF. Row 9 contains the information: First Name: William, Middle name is empty, Last Name: Smith, DoB: 6/3/1968, Abbreviation is empty. The second graphic depicts a side panel with two sections. The first section is the Inspect Columns view (labelled 1a). A single column named “Abbreviation” and a corresponding description is shown. The second section is the Inspect Rows view (labelled 1b). It contains a table with columns “Index”, “First Name”, “Middle Name”, “Last Name”, and “Abbreviation”. Within the table there are two groups of rows. The first group has an example row: Index: 4716, First Name: William, Middle Name is empty, Last Name: Smith, Abbreviation is empty. The second group has an example row: Index: 8984, First Name: Christopher, Middle Name: Michael, Last Name: Flemming, Abbreviation: CMF.
Figure 1. The initial view of the ColDeco side panel. An Abbreviation program is generated by the AI and added to the table as a new column. The Inspect Columns view (1a) shows the column generated by the AI, including a description of how the code works. The Inspect Rows view (1b) groups rows into different behaviors, indicating that there are errors in two rows.

If our user wants to investigate an error, they can expand a generated column into multiple helper columns, illustrated in Figure 2. These helper columns are visible in both the table (2a) and the side panel (2b), and they show the intermediate values. The user can now see that the missing abbreviations are caused by an error that occurred when the system tried to take the first and middle initials.

Two graphics. The first graphic (labelled 2a) depicts a table with 4 columns: “DoB”, “text concatenation”, “1st letter of Last Name”, “Abbreviation”. As examples, row 3 contains the information: DoB: 11/5/1995, text concatenation: CM, 1st letter of Lan Name: F, Abbreviation: CMF. Row 9 contains the information DoB: 6/3/1968, text concatenation: is empty, 1st letter of Lan Name: S, Abbreviation: is empty. The second graphic (labelled 2b) depicts a side panel showing the Inspect Columns view. A tree view shows “Abbreviation” as the root with two children: “1st letter of Last Name” and “text concatenation”, corresponding to the columns in the table. Each column in the tree view has a corresponding description.
Figure 2. The ColDeco side panel after a user expands the Abbreviation column into two additional helper columns. Each additional column has a description.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.


FxD: A functional debugger 

Not every spreadsheet task involves generating a new table column. Moreover, many users are already well acquainted with spreadsheet formulas. This brings us to our second tool, a spreadsheet formula debugger, introduced in the paper, “FxD: a functional debugger for dysfunctional spreadsheets.” 

We employed a user-centered approach when designing FxD, extensively reviewing existing literature on functional programming debuggers. This informed the four key features we implemented into FxD: 

Live debugging. FxD dynamically updates as a user edits a formula, allowing for quick formula modification and exploration (Figure 3, image 1).

Hybrid formula tracing. The debugger combines step-based evaluation (Figure 3, image 1) with tree-based derivations (Figure 3, image 3) to provide a step-by-step breakdown of the formula. Substeps are hidden behind expandable cards to prevent user overload.  

Subformula coloring. Color coding highlights changes in a formula as FxD evaluates it. This facilitates the tracking of these updates when a user hovers over a step (Figure 3, images 2 and 4). 

Information inspector. Context-aware tooltips improve the user experience. One example is table previews when a user hovers over ranges in functions like VLOOKUP. These tooltips offer insights into the range, surrounding context, and the lookup column used by the containing function (Figure 3, image 3).

Four graphics, each graphic describing a different feature of the debugger. The formula being debugged is ‘=IF(G3 < (B1 + B2) * (1 + B3), “low”, “high”)’. The first graphic (labelled 1) shows the formula and its evaluation trace. Each step in the trace shows the formula with some part evaluated. The last step is the value “low” which is the result of the formula. The second graphic (labelled 2) shows a step being highlighted. The step has a before formula and after formula, with multiple parts evaluated. Each part that is evaluated is highlighted with the same color in the “before” and “after” formula. The third graphic (labelled 3) shows a cell range being hovered on and a range information inspector being shown. The inspector shows a preview of the grid for the corresponding range. The fourth graphic (labelled 4) shows a step being highlighted and an evaluated subpart being hovered over. The user hovers over the value 15 in the “after” formula and the corresponding formula “B1 + B2” in the “before” formula is underlined.
Figure 3. The FxD debugger. Image 1 shows the edited formula and evaluation steps. The steps update as a user edits the formula. Image 2 shows subformula coloring, which highlights a subformula and its value upon hovering. Image 3 shows an information inspector that previews the range referenced in a formula. Image 4 shows the concurrent evaluation of multiple subformulas. When the user hovers over a value, the corresponding subformula is underlined.

Growing importance of AI code verification 

As the complexity of AI-generated code rises, the need for tools to verify accuracy becomes increasingly critical. In response, we developed these two co-audit tools tailored to spreadsheets. Moving forward, a key consideration lies in managing the complexity of these tools. Our vision is that debugging tools will become infused with generative AI to assist users in both generating and verifying workflows. 

Review our paper on co-auditing in general to learn more.

The post Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets appeared first on Microsoft Research.

Read More

Abstracts: October 9, 2023

Abstracts: October 9, 2023

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Dr. Sheng Zhang, a Senior Researcher at Microsoft Research, joins host Dr. Gretchen Huizinga to discuss “UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition.” In this paper, Zhang and his coauthors present mission-focused instruction tuning, a method for distilling large language models into smaller, more efficient ones for a broad application class. Their UniversalNER models achieved state-of-the-art performance in named entity recognition, an important natural language processing (NLP) task. Model distillation has the potential to make NLP and other capabilities more accessible, particularly in specialized domains such as biomedicine, which could benefit from more resource-efficient and transparent options. 


Learn more:

UniversalNER project website with demo (opens in new tab)

Code on GitHub (opens in new tab)

Dataset and models on Hugging Face (opens in new tab)

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract!—of their new and noteworthy papers. Today, I’m talking to Dr. Sheng Zhang, a Senior Researcher at Microsoft Research. Dr. Zhang is coauthor of a paper called “UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition,” and you can read this paper now on arXiv. Sheng Zhang, thanks for joining us on Abstracts!

SHENG ZHANG: Thanks for having me.


HUIZINGA: So in a few sentences, give us a brief introduction or overview of the issue or problem that your research addresses and why we should care about it.

ZHANG: Sure. Well, our research addresses the challenge of efficiently replicating the capabilities of large language models for targeted applications. Particularly, we focus on named entity recognition, or NER, and people should care because this work aims to create more cost-effective and transparent models that can recognize a wide range of entity types across various domains, which is crucial for knowledge extraction and has numerical practical applications.

HUIZINGA: So how does your approach, your particular approach, build on or differ from what’s been done previously in this field?

ZHANG: Well, our approach builds on the idea of instruction tuning, which is used to fine-tune language models to follow human instructions. However, unlike existing work that focuses on tuning models into replicas of large language models in every aspect, we propose a method called mission-focused instruction tuning, where we train a smaller model to specifically excel in a broad application class, such as open information extraction. And in our case study, we focus on named entity recognition, NER, and we demonstrate how targeted distillation from large language models can maximize their capabilities for this application. At the same time, the smaller model, the student model, also preserves generalizability across different semantic types and domains. This approach differs from previous work also because we emphasize the importance of increasing the diversity of input data and generating more comprehensive coverage of entity types, which ultimately leads to better performance in the targeted application.

HUIZINGA: OK. And in the paper, you talk about student models trailing the original large language models by large margins in what you call downstream applications. Give me an example of what downstream application looks like.

ZHANG: Yeah. So we here specifically focus on named entity recognition. That is, identifying named entities in a written text.

HUIZINGA: Ah …

ZHANG: So there’s various types of named entities so the canonical ones, like person, geographic location, organization … And people have, you know, various needs. They can go beyond those coarse-grained types. They can go into very fine-grained types, like athlete or politician …

HUIZINGA: Wow …

ZHANG: … or even, you know, finer-grain types. And you cannot like predefine what types will be considered in your task. That’s why we care about this universal concept of named entity recognition.

HUIZINGA: Well, let’s talk about methodology for a bit. What kind of research methodology did you use, and how did you conduct this research?

ZHANG: We developed a general recipe for targeted distillation from large language models, and in this case, we applied it to open NER. And our methodology consists of two main steps: data construction and mission-focused instruction tuning. For data construction, we sampled inputs from a large corpus across diverse domains, and then we used a large language model, ChatGPT, to annotate entity mentions and their associated entity types in the sampled inputs. This process allowed us to create a dataset with wide coverage of entity types. For mission-focused instruction tuning, we fine-tuned smaller models using our constructed dataset in a conversational-style format. For each entity type in the output, we transformed it into a natural language query and tuned the model to generate structured outputs that contain all entities of that type in the input passage. We also incorporated negative sampling to account for entity types not mentioned in that passage. And besides these two main steps, our research also involved assembling the largest-to-date, and most diverse, NER benchmark for evaluation. We compared the performance of our targeted distillation approach with other state-of-the-art models to demonstrate the effectiveness of our methodology.

HUIZINGA: OK, so you talk about NER as a case study, and you had 43 datasets and nine domains. Give me an example of some of those domains that you pulled from.

ZHANG: Yeah. So one very, you know, typical domain is like news, right. We read news every day, and the news mentions about, you know, people, events, and location. So that’s like a very common domain. And there are other very interesting domains like code. People also write code, and the computer can understand code, but a person would also want to understand code in some different way. So if you have like a code-specific named entity recognition capability, that would be awesome for, you know, some people that want to understand what’s happening in the code.

HUIZINGA: Right. And, and you mentioned programing, or code, but I also see in the paper biomedicine on one kind of complex and academic end and social media on another. So those are wildly different domains that you pulled from. Did you do that for a reason, that spectrum of different kinds of data?

ZHANG: Yes. The reason is that, you know, for some high-value domains like biomedicine, it’s quite expensive to annotate some data to train your model like that. So traditionally, people will have to hire an expert to do that. That is quite expensive and not scalable. And here, in the UniversalNER paper, we propose a way to distill that specific domain knowledge from the large language model. So the whole process is automatic. And the resulting model, you can see, it does pretty well, and maybe equally well, on the model that’s based on, you know, human expert–annotated corpus.

HUIZINGA: So after all this, a research paper presents findings. I imagine you had some interesting discoveries in, in this study. What were your major findings?

ZHANG: Yes. Our major findings were that the targeted distillation approach, specifically here the UniversalNER model we developed, it achieved state-of-the-art performance in named entity recognition across a wide range of entity types and domains. And when we compared it to other models like Alpaca, Vicuna, and InstructUIE, UniversalNER significantly outperformed them in terms of F1 score. This demonstrates the effectiveness of mission-focused instruction tuning for creating more cost-effective and transparent models that can excel in targeted applications such as open NER.

HUIZINGA: So let’s talk a little bit more about real-world impact. Uh, we’ve already discussed a little bit about that. But how would you say, based on these findings, that this impacts the real world and how people will use this?

ZHANG: Yeah, absolutely. I would say our work is very significant in terms of real-world impact because, first of all, NER is a fundamental task in natural language processing, and it plays a crucial role in knowledge extraction, information retrieval, and data mining. And by developing a more cost-effective and transparent model like UniversalNER, which can recognize a wide range of entity types and domains, we enable better performance in these downstream applications. And like I said, this is particularly important in high-value domains, such as biomedicine, where you know specialized expertise is required for annotation and the new entity types keep emerging. Our approach can help save time and resources for effectively recognizing these new entity types without the need for extensive annotated data. And secondly, our work can have a broader impact as it represents a general recipe for targeted distillation from large language models, and this approach can be applied to other application classes, such as, you know, open relation extraction. And this allows researchers and the practitioner to create much smaller models that can be more efficient and transparent while maintaining high performance in their targeted tasks.

HUIZINGA: If there was one thing you want our listeners to take away from this work and you could distill that into a short take, what would it be?

ZHANG: Mm hmm. One key takeaway from our work is that targeted distillation from large language models using our mission-focused instruction tuning can lead to more cost-effective and transparent models that excel in a broader application class. And our application demonstrated that it is possible to harness the capabilities of large language models and distill them into much smaller models that not only maintain generalizability across semantic types and domains but also surpass the performance of their larger counterparts in the targeted application. And this opens up new avenues for research and practical application in various fields, making knowledge extractions and the natural language processing tasks more efficient and accessible.

HUIZINGA: It sounds very promising, and it sounds like you’re excited about it.

ZHANG: Yeah, I’m pretty excited!

HUIZINGA: Well then tell us, given this new vista that you’ve opened up with this UniversalNER, what unanswered questions or unsolved problems still remain in this area, and what’s next on your research agenda?

ZHANG: Yeah. Our work demonstrates the effectiveness of targeted distillation for open NER, but several unanswered questions remain. And I would say the first one is adapting the approach to other application classes. Our method is a general recipe for targeted distillation, and it would be interesting to explore its effectiveness in other broader application classes, such as open relation extraction. And the second one is handling label conflicts and dataset-specific definitions. So in our work, we propose a dataset-specific instruction tuning template to address label conflicts. But more research is needed to better understand and develop methods for harmonizing discrepancies in label definitions across datasets. And the last one is exploring more efficient data construction methods. We used ChatGPT for data construction, but, you know, alternative approaches could be explored to generate more diverse and comprehensive datasets for mission-focused instruction tuning. And as for our research agenda, we plan to continue exploring targeted distillation techniques and apply them to other application classes, as well as investigate ways to improve data construction for better performance and efficiency in real-world tasks.

HUIZINGA: Sounds like you got your work cut out for you.

ZHANG: Yes. [LAUGHS] Thank you.

HUIZINGA: Sheng Zhang, thanks for joining us today. And to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/Abstracts, or you can read the paper on arXiv. See you next time on Abstracts!

The post Abstracts: October 9, 2023 appeared first on Microsoft Research.

Read More

Efficient and hardware-friendly neural architecture search with SpaceEvo

Efficient and hardware-friendly neural architecture search with SpaceEvo

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

ICCV 2023: SpaceEvo

In the field of deep learning, where breakthroughs like the models ResNet (opens in new tab) and BERT (opens in new tab) have achieved remarkable success, a key challenge remains: developing efficient deep neural network (DNN) models that both excel in performance and minimize latency across diverse devices. To address this, researchers have introduced hardware-aware neural architecture search (NAS) to automate efficient model design for various hardware configurations. This approach involves a predefined search space, search algorithm, accuracy estimation, and hardware-specific cost prediction models.

However, optimizing the search space itself has often been overlooked. Current efforts rely mainly on MobileNets-based search spaces designed to minimize latency on mobile CPUs. But manual designs may not always align with different hardware requirements, limiting their suitability for a diverse range of devices.

In the paper, “SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference (opens in new tab),” presented at ICCV 2023, (opens in new tab) we introduce SpaceEvo, a novel method that automatically creates specialized search spaces optimized for efficient INT8 inference on specific hardware platforms. What sets SpaceEvo apart is its ability to perform this design process automatically, creating a search space tailored for hardware-specific, quantization-friendly NAS.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


Notably, SpaceEvo’s lightweight design makes it ideal for practical applications, requiring only 25 GPU hours to create a hardware-specific solution and making it a cost-effective choice for hardware-aware NAS. This specialized search space, with hardware-preferred operators and configurations, enables the exploration of larger, more efficient models with low INT8 latency. Figure 1 demonstrates that our search space consistently outperforms existing alternatives in INT8 model quality. Conducting neural architecture searches within this hardware-friendly space yields models that set new INT8 accuracy benchmarks.

Figure1: The image displays 4 sub-figures, each illustrating model accuracy error distribution when sampling models within INT8 quantized latency at 10 ms on a VNNI CPU, 15 ms on a VNNI CPU, 10 ms on a Pixel 4 CPU, and 20ms on a Pixel CPU for various Search Spaces. Each sub-figure contains 4 – 5 curves, representing model accuracy error distributions from our search space, ProxylessNAS search space, MobileNetv3 search space, ResNet search space, and AttentiveNAS search space.  Our search space consistently delivers superior INT8 model populations, outperforming state-of-the-art alternatives under varying hardware and latency constraints.
Figure 1. Error distribution of INT8 quantized models across various NAS search spaces. Our search space consistently outperforms state-of-the-art alternatives in INT8 model quality.

On-device quantization latency analysis

We began our investigation by trying to understand INT8 quantized latency factors and their implications for search space design. We conducted our study on two widely used devices: an Intel CPU with VNNI instructions and onnxruntime support, and a Pixel 4 phone CPU with TFLite 2.7.

Our study revealed two critical findings:

  1. Both the choice of operator type and configurations, like channel width, significantly affect INT8 latency, illustrated in Figure 2. For instance, operators like Squeeze-and-Excitation and Hardswish, while enhancing accuracy with minimal latency, can lead to slower INT8 inference on Intel CPUs. This slowdown primarily arises from the added costs of data transformation between INT32 and INT8, which outweigh the latency reduction achieved through INT8 computation.
  2. Quantization efficiency varies among different devices, and preferred operator types can be contradictory.
Figure2: The image showcases a table (left) and a figure (right). The table on the left, labeled
Figure 2. Left: Selecting different operator types results in notably distinct quantized speed improvements. Right: Conv1x1 speed enhancements across various channel numbers.

Finding diverse, efficient quantized models with SpaceEvo

Unlike traditional architecture search, which aims to find the best single model, our objective is to uncover a diverse population of billions of accurate and INT8 latency-friendly architectures within the search space.

Drawing inspiration from neural architecture search, we introduced an evolutionary search algorithm to explore this quantization-friendly model population in SpaceEvo. Our approach incorporated three key techniques:

  1. The introduction of the Q-T score as a metric to measure the quantization-friendliness of a candidate search space, based on the INT8 accuracy-latency of top-tier subnets.
  2. Redesigned search algorithms that focus on exploring a collection of model populations (i.e., the search space) within the vast hyperspace, as illustrated in Figure 3. This is achieved through the “elastic stage,” which divides the search space into a sequence of elastic stages, allowing traditional evolution methods like aging evolution to explore effectively.
  3. A block-wise search space quantization scheme to reduce the training costs associated with exploring a search space that has a maximum Q-T score.

After discovering the search space, we employed a two-stage NAS process to train a quantized-for-all supernet over the search space. This ensured that all candidate models could achieve comparable quantized accuracy without individual fine-tuning or quantization. We utilized evolutionary search and nn-Meter (opens in new tab) for INT8 latency prediction to identify the best quantized models under various INT8 latency constraints. Figure 3 shows the overall design process.

Figure3: The image depicts a flowchart that outlines the complete SpaceEvo process and its application for NAS. Starting with a large hyperspace, an evolution search algorithm explores a candidate search space. A quality estimator then assesses its quality score based on INT8 latency and accuracy. This score is used as a reward for the algorithm, guiding further exploration until a suitable search space is found. A quantized-for-all supernet is then trained over this space, enabling hardware-aware NAS for deploying models within various INT8 latency constraints.
Figure 3: The complete SpaceEvo process and application for NAS

Extensive experiments on two real-world edge devices and ImageNet demonstrated that our automatically designed search spaces significantly surpass manually designed search spaces. Table 1 showcases our discovered models, SEQnet, setting new benchmarks for INT8 quantized accuracy-latency tradeoffs. 

(a) Results on the Intel VNNI CPU with onnxruntime
Model Top-1 Acc % Latency Top-1 Acc % FLOPs
INT8 INT8 Speedup FP32
MobileNetV3Small 66.3 4.4 ms 1.1x 67.4 56M
SEQnet@cpu-A0 74.7 4.4 ms 2.0x 74.8 163M
MobileNetV3Large 74.5 10.3 ms 1.5x 75.2 219M
SEQnet@cpu-A1 77.4 8.8 ms 2.4x 77.5 358M
FBNetV3-A 78.2 27.7 ms 1.3x 79.1 357M
SEQnet@cpu-A4 80.0 24.4 ms 2.4x 80.1 1267M
(b) Results on the Google Pixel 4 with TFLite
MobileNetV3Small 66.3 6.4 ms 1.3x 67.4 56M
SEQnet@pixel4-A0 73.6 5.9 ms 2.1x 73.7 107M
MobileNetV3Large 74.5 15.7 ms 1.5x 75.2 219M
EfficientNet-B0 76.7 36.4 ms 1.7x 77.3 390M
SEQnet@pixel4-A1 77.6 14.7 ms 2.2x 77.7 274M
Table 1. Our automated search spaces outperformed manual ones in ImageNet results on two devices. Speedup: INT8 latency compared with FP32 inference.

Potential for sustainable and efficient computing

SpaceEvo is the first attempt to address the hardware-friendly search space optimization challenge in NAS, paving the way for designing effective low-latency DNN models for diverse real-world edge devices. Looking ahead, the implications of SpaceEvo reach far beyond its initial achievements. Its potential extends to applications for other crucial deployment metrics, such as energy and memory consumption, enhancing the sustainability of edge computing solutions.

We are exploring adapting these methods to support diverse model architectures like transformers, further expanding its role in evolving deep learning model design and efficient deployment.

The post Efficient and hardware-friendly neural architecture search with SpaceEvo appeared first on Microsoft Research.

Read More

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

This research paper was presented at the 2023 IEEE/CVF International Conference on Computer Vision (opens in new tab) (ICCV), a premier academic conference for computer vision.

When was the last time you were faced with a task you had no clue how to tackle? Maybe it was fixing a broken bike, replacing a printer toner, or making a cup of espresso? In such circumstances, your usual options might include reaching out to a knowledgeable friend or relative for assistance. Alternatively, you might resort to scouring the internet, conducting a web search, posing questions on online forums, or seeking out relevant instructional videos. But what if there were another option? What if you could turn to an AI assistant, or copilot, for help?

AI in the real world

Our daily lives are filled with a wide range of tasks, both for work and leisure, spanning the digital and physical realms. We often find ourselves in need of guidance to learn and carry out these tasks effectively. Recent advances in AI, particularly in the areas of large language and multimodal models, have given rise to intelligent digital agents. However, when it comes to the physical world, where we perform a significant number of our tasks, AI systems have historically faced greater challenges. 

A longstanding aspiration within the AI community has been to develop an interactive AI assistant capable of perceiving, reasoning, and collaborating with people in the real world. Whether it’s scenarios like autonomous driving, robot navigation and manipulation, hazard detection in industrial settings, or support and guidance for mixed-reality tasks, progress in physical activities has been slower and more incremental compared with their fully digital counterparts.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


The promise and challenge of interactive AI “copilots”

There is great potential for developing interactive AI copilots to assist people with real-world tasks, but there are also obstacles. The key challenge is that current state-of-the-art AI assistants lack firsthand experience in the physical world. Consequently, they cannot perceive the state of the real world and actively intervene when necessary. This limitation stems from a lack of training on the specific data required for perception, reasoning, and modeling in such scenarios. In terms of AI development, there’s a saying that “data is king.” This challenge is no exception. To advance interactive AI agents for physical tasks, we must thoroughly understand the problem domain and establish a gold standard for copilots’ capabilities.

A new multimodal interactive dataset

As a first step in this direction, we are excited to share our paper, “HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World (opens in new tab),” presented at ICCV 2023 (opens in new tab). HoloAssist is a large-scale egocentric, or first-person, human interaction dataset, where two people collaboratively execute physical manipulation tasks. A task performer executes a task while wearing a mixed-reality headset that captures seven synchronized data streams, as shown in Figure 1. Simultaneously, a task instructor observes the performer’s first-person video feed in real time and offers verbal instruction. 

An image illustrating the setup for the HoloAssist dataset, which features a two-person interactive assistive task-completion setting.  A task-performer is wearing a mixed reality headset while an instructor watches the first-person video feed and provides instructions.  Eight modalities are captured, RGB, eye gaze, hand pose, head pose, depth, IMU, audio, text transcription.
Figure 1: HoloAssist features a two-person interactive assistive task-completion setting.

HoloAssist contains a large collection of data, comprising 166 hours of recordings involving 222 diverse participants. These participants form 350 distinct instructor-performer pairs carrying out 20 object-centric manipulation tasks. Video 1 shows how tasks are recorded, while Figure 2 provides a task breakdown. The objects range from common electronic devices to rarer items found in factories and specialized labs. The tasks are generally quite demanding, often requiring instructor assistance for successful completion. To provide comprehensive insights, we’ve captured seven different raw sensor modalities: RGB, depth, head pose, 3D hand pose, eye gaze, audio, and IMU. These modalities help in understanding human intentions, estimating world states, predicting future actions, and more. inally, the eighth modality is an augmentation with third-person manual annotations, consisting of a text summary, intervention types, mistake annotations, and action segments, as illustrated in Figure 3.

Video 1: A sampling of task recordings showcasing color and depth, two of the eight modalities.
Data distribution captured in HoloAssist. On the left, the number of sessions per activity, and on the right, the total length of sessions in minutes. There are 20 tasks: GoPro, Nintendo Switch, DSLR, portable printer, computer, Nespresso machine, standalone printer, big coffee machine, IKEA furniture (stool, utility cart, tray table, nightstand), NavVis laser scanner, ATV motorcycle, wheel belt, and circuit breaker.  There are between 25 and 180 sessions per activity and sessions range from 47 to 1390 minutes.
Figure 2: Data distribution captured in HoloAssist. On the left, the number of sessions per activity. On the right, the total session length in minutes.
HoloAssist includes action and conversational annotations and provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.  The image shows examples of each of these.
Figure 3: HoloAssist includes action and conversational annotations, and it also provides summaries of videos indicating mistakes and interventions during tasks. Each action is tagged with a “mistake” or “correct” attribute, while spoken statements are labeled with intervention types.

Towards proactive AI assistants

Our work builds on previous advancements in egocentric vision and embodied AI. Unlike earlier datasets, such as those listed in Table 1, HoloAssist stands out due to its multi-person, interactive task-execution setting. Human interaction during task execution provides a valuable resource for designing AI assistants that are anticipatory and proactive that can provide precisely timed instructions that are grounded in the environment, in contrast with current “chat-based” AI assistants that wait for you to ask a question. This unique scenario is ideal for developing assistive AI agents and complements existing datasets, which contribute rich knowledge and representation.

The table shows a comparison of nine related datasets and simulation platforms and for each dataset the setting, whether it is collaborative and interactive, instructional and procedural, and the number of hours of video.  HoloAssist features a multi-person assistive setting which is a unique addition to existing first-person (egocentric) datasets.
Table 1: Comparison of related datasets and simulation platforms. HoloAssist features a multi-person assistive setting, which is a unique addition to existing egocentric (first-person) datasets.

Finally, we evaluated the dataset’s performance on action classification and anticipation tasks, providing empirical results that shed light on the role of different modalities in various tasks. With this dataset, we introduce new tasks and benchmarks focused on mistake detection, intervention type prediction, and 3D hand pose forecasting, all crucial elements for developing intelligent assistants.

Looking forward

This work represents an initial step in broader research that explores how intelligent agents can collaborate with humans in real-world tasks. We’re excited to share this work and our dataset with the community and, anticipate numerous future directions, such as annotating object poses, investigating object-centric models of affordance and manipulations in AI assistance, and AI-assisted planning and state tracking, among others. We believe HoloAssist, along with its associated benchmarks and tools, will benefit future research endeavors focused on building powerful AI assistants for real-world everyday tasks. You can access the HoloAssist dataset and code on GitHub (opens in new tab).

Contributors

Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Marc Pollefeys

The post HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world appeared first on Microsoft Research.

Read More