LST-Bench: A new benchmark tool for open table formats in the data lake

LST-Bench: A new benchmark tool for open table formats in the data lake

This paper was presented at the ACM SIGMOD/Principles of Database Systems Conference (opens in new tab) (SIGMOD/PODS 2024), the premier forum on large-scale data management and databases.

SIGMOD PODS 2024 logo to the left of the first page of

As organizations grapple with ever-expanding datasets, the adoption of data lakes has become a vital strategy for scalable and cost-effective data management. The success of these systems largely depends on the file formats used to store the data. Traditional formats, while efficient in data compression and organization, falter with frequent updates. Advanced table formats like Delta Lake, Apache Iceberg, and Apache Hudi offer promising solutions with easier data modifications and historical tracking, yet their efficacy lies in their ability to handle continuous updates, a challenge that requires extensive and thorough evaluation.

Our paper, “LST-Bench: Benchmarking Log-Structured Tables in the Cloud (opens in new tab),” presented at SIGMOD 2024, introduces an innovative tool designed to evaluate the performance of different table formats in the cloud. LST-Bench builds on the well-established TPC-DS (opens in new tab) benchmark—which measures how efficiently systems handle large datasets and complex queries—and includes features specifically designed for table formats, simplifying the process of testing them under real-world conditions. Additionally, it automatically conducts tests and collects essential data from both the computational engine and various cloud services, enabling accurate performance evaluation.

Flexible and adaptive testing

Designed for flexibility, LST-Bench adapts to a broad range of scenarios, as illustrated in Figure 1. The framework was developed by incorporating insights from engineers, facilitating the integration of existing workloads like TPC-DS, while promoting reusability. For example, each test session establishes a new connection to the data-processing engine, organizing tasks as a series of statements. This setup permits developers to run multiple tasks either sequentially within a single session or concurrently across various sessions, reflecting real-world application patterns.

A diagram showing workload components in LST-Bench and their relationships.
Figure 1. Workload components in LST-Bench and their relationships. A task is a sequence of SQL statements, while a session is a sequence of tasks that represents a logical unit of work or a user session. A phase is a group of concurrent sessions that must be completed before the next phase can start. Lastly, a workload is a sequence of phases.

The TPC-DS workload comprises the following foundational tasks:

  • Load task: Loads data into tables for experimentation.
  • Single User task: Executes complex queries to test the engine’s upper performance limit.
  • Data Maintenance task: Handles data insertions and deletions.

LST-Bench introduces the following tasks specific to table formats:

  • Optimize task: Compacts the data files within a table.
  • Time Travel task: Enables querying data as it appeared at a specified point in the past.
  • Parameterized Custom task: Allows for the integration of user-defined code to create dynamic workflows.

These features enable LST-Bench to evaluate aspects of table formats that are not covered by TPC-DS, providing deeper insights into their performance, as shown in Figure 2.

A diagram illustrating various LST-Bench tasks combined to create workloads that provide insights into table formats. The workloads assess the handling of frequent data modifications over time, optimizing tables for multiple modifications of varying sizes, managing simultaneous reading and writing sessions, querying data across different time points, and evaluating the impact of batch size variations on read query performance.
Figure 2. LST-Bench expands on TPC-DS by introducing a flexible workload representation and incorporating extensions that help users gain insights into table formats previously overlooked by the original benchmark.

A degradation rate metric to measure stability

In addition to these workload extensions, LST-Bench introduces new metrics to evaluate table formats both comprehensively and fairly. It retains the traditional metric categories like performance, storage, and compute efficiency, and it adds a new stability metric called degradation rate. This new metric specifically addresses the impact of accumulating small files in the data lake—a common issue arising from frequent, small updates—providing an assessment of the system’s efficiency over time.

The degradation rate is calculated by dividing a workload into different phases. The degradation rate (S_{DR}) is defined as follows:

(S_{DR}={1over n}sumlimits_{i=1}^ndfrac{M_{i} – M_{i-1}}{M_{i-1}})

Here, (M_i) represents the performance or efficiency metric value of the (i^{th}) iteration of a workload phase, and (n) reflects the total number of iterations of that phase. Intuitively, (S_{DR}) is the rate at which a metric grows or shrinks, reflecting cumulative effects of changes in the underlying system’s state. This rate provides insight into how quickly a system degrades over time. A stable system demonstrates a low (S_{DR}), indicating minimal degradation.

LST-Bench implementation

The LST-Bench features a Java-based client application that runs SQL workloads on various engines, enabling users to define tasks, sessions, and phase libraries to reuse different workload components. This allows them to reference these libraries in their workload definitions, add new task templates, or create entirely new task libraries to model-specific scenarios.

LST-Bench also includes a processing module that consolidates experimental results and calculates metrics to provide insights into table formats and engines. It uses both internal telemetry from LST-Bench and external telemetry from cloud services, such as resource utilization, storage API calls, and network I/O volume. The metrics processor offers multiple visualization options, including notebooks and a web app, to help users analyze performance data effectively.

An illustration depicting the components and execution model of the LST-Bench tool. The Client Application establishes connections with engines via dedicated drivers, while the Metrics Processor gathers telemetry from the Client Application, engines, and other cloud services. This data is aggregated and visualized using either a notebook or web application.
Figure 3. The LST-Bench tool components and execution model.

Implications and looking ahead

LST-Bench integrates seamlessly into the testing workflows of the Microsoft Fabric (opens in new tab) warehouse, allowing that team to rigorously assess engine performance, evaluate releases, and identify any issues. This leads to a more reliable and optimized user experience on the Microsoft Fabric data analytics platform. Additionally, LST-Bench holds promise as a foundational tool for various Microsoft initiatives. It’s currently instrumental in research projects focused on improving data organization for table formats, with the goal of increasing the performance of customer workloads on Microsoft Fabric. LST-Bench is also being used to evaluate the performance of table formats converted using Apache XTable (Incubating) (opens in new tab), an open-source tool designed to prevent data silos within data lakes.

LST-Bench is open source (opens in new tab), and we welcome contributors to help expand this tool, making it highly effective for organizations to thoroughly evaluate their table formats.

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.


Acknowledgements

We would like to thank Joyce Cahoon (opens in new tab) and Yiwen Zhu (opens in new tab) for their valuable discussions on the stability metric, and Jose Medrano (opens in new tab) and Emma Rose Wirshing (opens in new tab) for their feedback on LST-Bench and their work on integrating it with the Microsoft Fabric Warehouse.

The post LST-Bench: A new benchmark tool for open table formats in the data lake appeared first on Microsoft Research.

Read More

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon. 

Keynote: Building Globally Equitable AI

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi at Microsoft Research Forum Episode 3

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi 

Jacki O’Neill discussed the importance of creating globally equitable generative AI. She addressed the technical and sociotechnical challenges that must be tackled to positively transform the future of work worldwide.

“We’re at the very early stage of generative AI and the impacts it will have on work. This is a fast-moving field, and there’s an immense opportunity to take control of the agenda and build truly globally equitable AI systems. This requires ensuring that diverse contexts and applications, with their diverse datasets, drive the development of generative AI.”

Panel discussion: Generative AI for Global Impact: Challenges and Opportunities

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)

Sunayana Sitaram, Principal Researcher, Microsoft Research India

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi (host)
Sunayana Sitaram, Principal Researcher, Microsoft Research India
Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge
Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

Microsoft researchers discussed the challenges and opportunities of making AI more inclusive and impactful for everyone—from data that represents a broader range of communities and cultures to novel use cases for AI that are globally relevant.

“How can we take this power of generative AI and empower every individual, every individual across the globe—the people who are coming from different nationalities, different ethnicities, cultures, as well as with varied technology access and financial affordability?”

—Tanuja Ganu, Principal Research SDE Manager, Microsoft Research India

“One of the solutions that we’ve been using is to actually design with ‘human in the loop’ in mind because we know that these technologies are not perfect. And so, we really want to figure out ways in which humans and AI systems can work together in order to create the most effective outcome.”

—Sunayana Sitaram, Principal Researcher, Microsoft Research India

“We really need multidisciplinary research that goes beyond anything that we’ve done before, involving researchers and practitioners and community members. And it’s important to remember that machine learning engineers and researchers on their own can’t solve the problem of building globally equitable generative AI. This is something that we really need to do in a large scale.”

—Jacki O’Neill, Lab Director, Microsoft Research Africa, Nairobi 

“An estimated 1.3 billion people—around 16 percent of the global population—live with some level of disability today. So, I think it’s really exciting to see these generative AI applications coming online for these communities.” 

“As we look to this next decade of generative AI solutions, I really hope to see that we’re going to see more personalized AI models and solutions come through much more strongly, solutions where you as the user have much more control, much more agency, around how your model works for you.” 

—Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Lightning talk: Insights into the Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: A Case Study on CLIP

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge at Research Forum Episode 3

Daniela Massiceti, Senior Researcher, Microsoft Research Cambridge

Daniela Massiceti explored the transformative potential of multimodal models such as CLIP for assistive technologies. Specifically focusing on the blind/low-vision community, the talk explored the current distance from realizing this potential and the advancements needed to bridge this gap.

“Today’s AI models hold incredible potential for assisting the Blind community—from text recognition to object identification to question answering. Apps like Seeing AI are already deploying some of these AI features. But there is potential for much more.”

Lightning talk: Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia, at Research Forum Episode 3

Jiang Bian, Senior Principal Research Manager, Microsoft Research Asia

Jiang Bian discussed how generative AI transforms industries by bridging gaps between AI capabilities and industrial needs.

“In our dialogues with strategic partners, we have identified crucial gaps in current generative AI capabilities versus the specific needs of industry applications. These include a too-narrow focus on human-like AI but not critical industry applications, limitations in processing complex and noisy data, and concerns about reliability in complex decision-making scenarios. Our research is crucial in addressing these limitations and amplifying the underappreciated potential of generative AI in high-value sectors.” 

Lightning talk: MatterGen: A Generative Model for Materials Design

Tian Xie, Principal Research Manager, Microsoft Research, at Research Forum Episode 3

Tian Xie, Principal Research Manager, Microsoft Research

Tian Xie described MatterGen, a generative model that enables the design of new inorganic materials based on a broad range of property conditions required by the application, aiming to shift the traditional paradigm of materials design with generative AI.

“Traditionally, materials design is conducted by search-based methods. We search through a list of candidates and gradually filter them using a list of design criteria for the application. Like for batteries, we need the materials to contain lithium, to be stable, to have a high lithium-ion conductivity, and each filtering step can be conducted using simulation-based methods or AI emulators. At the end, we get five to 10 candidates that we’re sending to the lab for experimental synthesis.” 

“In MatterGen, we hope to rethink this process with generative AI. We’re aiming to directly generate materials given the design requirements for the target application, bypassing the process of searching through candidates. You can think of it as using text-to-image generative models like DALL-E to generate the images given a prompt rather than needing to search through the entire internet for images via a search engine.” 

Lightning talk: AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers, at Research Forum Episode 3

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers 

Adam Fourney discussed the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He showcased their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

“We’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.”

The post Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more appeared first on Microsoft Research.

Read More

Microsoft at FAccT 2024: Advancing responsible AI research and practice

Microsoft at FAccT 2024: Advancing responsible AI research and practice

Microsoft at ACM FAccT 2024

The integration of AI and other computational technologies is becoming increasingly common in high-stakes sectors such as finance, healthcare, and government, where their capacity to influence critical decisions is growing. While these systems offer numerous benefits, they also introduce risks, such as entrenching systemic biases and reducing accountability. The ACM Conference on Fairness, Accountability, and Transparency (ACM FaccT 2024) tackles these issues, bringing together experts from a wide range of disciplines who are committed to the responsible development of computational systems.

Microsoft is proud to return as a sponsor of ACM FAccT 2024, underscoring our commitment to supporting research on responsible AI. We’re pleased to share that members of our team have taken on key roles in organizing the event, contributing to the program committee and serving as a program co-chair. Additionally, seven papers by Microsoft researchers and their collaborators have been accepted to the program, with “Akal badi ya bias: An exploratory study of gender bias in Hindi language technology,” receiving an award for Best Paper. 

Collectively, these research projects emphasize the need for AI technologies that reflect the Microsoft Responsible AI principles of accountability, inclusiveness, reliability and safety, fairness, transparency, and privacy and security. They underscore the importance of addressing potential risks and harms associated with deployment and usage. This post highlights these advances.

Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.


Paper highlights

A framework for exploring the consequences of AI-mediated enterprise knowledge access and identifying risks to workers

Anna Gausen, Bhaskar Mitra, Siân Lindley

Recent AI developments, especially LLMs, are significantly impacting organizational knowledge access and reshaping workplaces. These AI systems pose risks due to their interaction with organizational power dynamics. This paper introduces the Consequence-Mechanism-Risk framework to help identify worker risks, categorizing them into issues related to value, power, and wellbeing. The framework aims to help practitioners mitigate these risks and apply it to other technologies, enabling better protection for workers.

A structured regression approach for evaluating model performance across intersectional subgroups

Christine Herlihy, Kimberly Truong, Alex Chouldechova, Miro Dudík

Disaggregated evaluation is a process used in AI fairness assessment that measures AI system performance across different subgroups. These subgroups are defined by a mix of demographic or other sensitive attributes. However, the sample size for intersectional subgroups is often very small, leading to their exclusion from analysis. This work introduces a structured regression approach for more reliable system performance estimates in these subgroups. Tested on two publicly available datasets and several variants of semi-synthetic data, this method not only yielded more accurate results but also helped to identify key factors driving performance differences. 

Akal badi ya bias: An exploratory study of gender bias in Hindi language technology

Best Paper Award

Rishav Hada, Safiya Husain, Varun Gumma, Harshita Diddee, Aditya Yadavalli, Agrima Seth, Nidhi Kulkarni, Ujwal Gadiraju, Aditya Vashistha, Vivek Seshadri, Kalika Bali

Existing research on gender bias in language technologies primarily focuses on English, often overlooking non-English languages. This paper introduces the first comprehensive study on gender bias in Hindi, the third most spoken language globally. Employing diverse techniques and field studies, the authors expose the limitations in current methodologies and emphasize the need for more context-specific and community-centered research. The findings deepen the understanding of gender bias in language technologies in Hindi and lay the groundwork for expanded research into other Indic languages.

“I’m not sure, but…”: Examining the impact of large language models’ uncertainty expression on user reliance and trust

Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, Jennifer Wortman Vaughan

LLMs can produce convincing yet incorrect responses, potentially misleading users who rely on them for accuracy. To mitigate this issue, there have been recommendations for LLMs to communicate uncertainty in their responses. In a large-scale study on how users perceive and act on LLMs’ expressions of uncertainty, participants were asked medical questions. The authors found that first-person uncertainty expressions (e.g., “I’m not sure, but…”) decreased participants’ confidence in the system and their tendency to agree with the system’s answers, while increasing the accuracy of their own answers. In contrast, more general uncertainty expressions (e.g., “It’s unclear, but…”) were less effective. The findings stress the importance of more thorough user testing before deploying LLMs.

Investigating and designing for trust in AI-powered code generation tools

Ruotong Wang, Ruijia Cheng, Denae Ford, Tom Zimmermann

As tools like GitHub Copilot gain popularity, understanding the trust software developers place in these applications becomes crucial for their adoption and responsible use. In a two-stage qualitative study, the authors interviewed 17 developers to understand the challenges they face in building trust in AI code-generation tools. Challenges identified include setting expectations, configuring tools, and validating suggestions. The authors also explore several design concepts to help developers establish appropriate trust and provide design recommendations for AI-powered code-generation tools.

Less discriminatory algorithms

Emily Black, Logan Koepke, Pauline Kim, Solon Barocas, Mingwei Hsu

In fields such as housing, employment, and credit, organizations using algorithmic systems should seek to use less discriminatory alternatives. Research in computer science has shown that for any prediction problem, multiple algorithms can deliver the same level of accuracy but differ in their impacts across demographic groups. This phenomenon, known as model multiplicity, suggests that developers might be able to find an equally performant yet potentially less discriminatory alternative.

Participation in the age of foundation models

Harini Suresh, Emily Tseng, Meg Young, Mary Gray, Emma Pierson, Karen Levy

The rise of foundation models in public services brings both potential benefits and risks, including reinforcing power imbalances and harming marginalized groups. This paper explores how participatory AI/ML methods, typically context-specific, can be adapted to these context-agnostic models to empower those most affected.

Conference organizers from Microsoft

Program Co-Chair

Alexandra Olteanu 

Program Committee

Steph Ballard 
Solon Barocas 
Su Lin Blodgett*
Kate Crawford 
Shipi Dhanorkar 
Amy Heger
Jake Hofman*
Emre Kiciman*
Vera Liao*
Daniela Massiceti 
Bhaskar Mitra 
Besmira Nushi*
Alexandra Olteanu 
Amifa Raj
Emily Sheng 
Jennifer Wortman Vaughan*
Mihaela Vorvoreanu*
Daricia Wilkinson

*Area Chairs

Career opportunities

Microsoft welcomes talented individuals across various roles at Microsoft Research, Azure Research, and other departments. We are always pushing the boundaries of computer systems to improve the scale, efficiency, and security of all our offerings. You can review our open research-related positions here.

The post Microsoft at FAccT 2024: Advancing responsible AI research and practice appeared first on Microsoft Research.

Read More

Introducing Aurora: The first large-scale foundation model of the atmosphere

Introducing Aurora: The first large-scale foundation model of the atmosphere

satellite image of Storm Ciarán

When Storm Ciarán battered northwestern Europe in November 2023, it left a trail of destruction. The low-pressure system associated with Storm Ciarán set new records for England, marking it as an exceptionally rare meteorological event. The storm’s intensity caught many off guard, exposing the limitations of current weather-prediction models and highlighting the need for more accurate forecasting in the face of climate change. As communities grappled with the aftermath, the urgent question arose: How can we better anticipate and prepare for such extreme weather events? 

A recent study by Charlton-Perez et al. (2024) underscored the challenges faced by even the most advanced AI weather-prediction models in capturing the rapid intensification and peak wind speeds of Storm Ciarán. To help address those challenges, a team of Microsoft researchers developed Aurora, a cutting-edge AI foundation model that can extract valuable insights from vast amounts of atmospheric data. Aurora presents a new approach to weather forecasting that could transform our ability to predict and mitigate the impacts of extreme events—including being able to anticipate the dramatic escalation of an event like Storm Ciarán.  

A flexible 3D foundation model of the atmosphere

Aurora is a 1.3 billion parameter foundation model for high-resolution  forecasting of weather and atmospheric processes. Aurora is a flexible 3D Swin Transformer with 3D Perceiver-based encoders and decoders. At pretraining time, Aurora is optimised to minimise a loss on multiple heterogeneous datasets with different resolutions, variables, and pressure levels. The model is then fine-tuned in two stages: (1) short-lead time fine-tuning of the pretrained weights (2) long-lead time (rollout) fine-tuning using Low Rank Adaptation (LoRA). The fine-tuned models are then deployed to tackle a diverse collection of operational forecasting scenarios at different resolutions.
Figure 1: Aurora is a 1.3 billion parameter foundation model for high-resolution forecasting of weather and atmospheric processes. Aurora is a flexible 3D Swin Transformer with 3D Perceiver-based encoders and decoders. At pretraining time, Aurora is optimized to minimize a loss on multiple heterogeneous datasets with different resolutions, variables, and pressure levels. The model is then fine-tuned in two stages: (1) short-lead time fine-tuning of the pretrained weights and (2) long-lead time (rollout) fine-tuning using Low Rank Adaptation (LoRA). The fine-tuned models are then deployed to tackle a diverse collection of operational forecasting scenarios at different resolutions.

Aurora’s effectiveness lies in its training on more than a million hours of diverse weather and climate simulations, which enables it to develop a comprehensive understanding of atmospheric dynamics. This allows the model to excel at a wide range of prediction tasks, even in data-sparse regions or extreme weather scenarios. By operating at a high spatial resolution of 0.1° (roughly 11 km at the equator), Aurora captures intricate details of atmospheric processes, providing more accurate operational forecasts than ever before—and at a fraction of the computational cost of traditional numerical weather-prediction systems. We estimate that the computational speed-up that Aurora can bring over the state-of-the-art numerical forecasting system Integrated Forecasting System (IFS) is ~5,000x. 

Beyond its impressive accuracy and efficiency, Aurora stands out for its versatility. The model can forecast a broad range of atmospheric variables, from temperature and wind speed to air-pollution levels and concentrations of greenhouse gases. Aurora’s architecture is designed to handle heterogeneous gold standard inputs and generate predictions at different resolutions and levels of fidelity. The model consists of a flexible 3D Swin Transformer with Perceiver-based encoders and decoders, enabling it to process and predict a range of atmospheric variables across space and pressure levels. By pretraining on a vast corpus of diverse data and fine-tuning on specific tasks, Aurora learns to capture intricate patterns and structures in the atmosphere, allowing it to excel even with limited training data when it is being fine-tuned for a specific task. 

Fast prediction of atmospheric chemistry and air pollution

Sample predictions for total column nitrogen dioxide by Aurora compared to CAMS analysis. Aurora was initialised with CAMS analysis at 1 Sep 2022 00 UTC. Predicting atmospheric gasses correctly is extremely challenging due to their spatially heterogeneous nature. In particular, nitrogen dioxide, like most variables in CAMS, is skewed towards high values in areas with large anthropogenic emissions such as densely populated areas in East Asia. In addition, it exhibits a strong diurnal cycle; e.g., sunlight reduces background levels through a process called photolysis. Aurora accurately captures both the extremes and background levels.
Latitude-weighted root mean square error (RMSE) of Aurora relative to CAMS, where negative values (blue) mean that Aurora is better. The RMSEs are computed over the period Jun 2022 to Nov 2022 inclusive. Aurora matches or outperforms CAMS on 74% of the targets.
Figure 2: Aurora outperforms operational CAMS across many targets. (a) Sample predictions for total column nitrogen dioxide by Aurora compared to CAMS analysis. Aurora was initialized with CAMS analysis at 1 Sep 2022 00 UTC. Predicting atmospheric gases correctly is extremely challenging due to their spatially heterogeneous nature. In particular, nitrogen dioxide, like most variables in CAMS, is skewed toward high values in areas with large anthropogenic emissions, such as densely populated areas in East Asia. In addition, it exhibits a strong diurnal cycle; e.g., sunlight reduces background levels via a process called photolysis. Aurora accurately captures both the extremes and background levels. (b) Latitude-weighted root mean square error (RMSE) of Aurora relative to CAMS, where negative values (blue) mean that Aurora is better. The RMSEs are computed over the period Jun 2022 to Nov 2022 inclusive. Aurora matches or outperforms CAMS on 74% of the targets.

A prime example of Aurora’s versatility is its ability to forecast air-pollution levels using data from the Copernicus Atmosphere Monitoring Service (CAMS), a notoriously difficult task due to the complex interplay of atmospheric chemistry, weather patterns, and human activities, as well as the highly heterogeneous nature of CAMS data. By leveraging its flexible encoder-decoder architecture and attention mechanisms, Aurora effectively processes and learns from this challenging data, capturing the unique characteristics of air pollutants and their relationships with meteorological variables. This enables Aurora to produce accurate five-day global air-pollution forecasts at 0.4° spatial resolution, outperforming state-of-the-art atmospheric chemistry simulations on 74% of all targets, demonstrating its remarkable adaptability and potential to tackle a wide range of environmental prediction problems, even in data-sparse or highly complex scenarios. 

Data diversity and model scaling improve atmospheric forecasting

One of the key findings of this study is that pretraining on diverse datasets significantly improves Aurora’s performance compared to training on a single dataset. By incorporating data from climate simulations, reanalysis products, and operational forecasts, Aurora learns a more robust and generalizable representation of atmospheric dynamics. It is thanks to its scale and diverse pretraining data corpus that Aurora is able outperform state-of-the-art numerical weather-prediction models and specialized deep-learning approaches across a wide range of tasks and resolutions. 

Performance versus ERA5 2021 at 6h lead time for models pretrained on different dataset configurations (i.e., no fine-tuning) labeled by C1-C4. The root mean square errors (RMSEs) are normalised by the performance of the ERA5-pretrained model (C1). Adding low-fidelity simulation data from CMIP6 (i.e., CMCC and IFS-HR) improves performance almost uniformly (C2). Adding even more simulation data improves performance further on most surface variables and for the atmospheric levels present in this newly added data (C3). Finally, configuration C4, which contains a good coverage of the entire atmosphere and also contains analysis data from GFS achieves the best overall performance with improvements across the board.
Pretraining on many diverse data sources improves the forecasting of extreme values at 6h lead time across all surface variables of IFS-HRES 2022. Additionally, the results also hold on wind speed, which is a nonlinear function of 10U and 10V.
Bigger models obtain lower validation loss for the same amount of GPU hours. We fit a power law that indicates a 5% reduction in the validation loss for every doubling of the model size.
Figure 3: Pretraining on diverse data and increasing model size improves performance. (a) Performance versus ERA5 2021 at 6h lead time for models pretrained on different dataset configurations (i.e., no fine-tuning) labeled by C1-C4. The root mean square errors (RMSEs) are normalized by the performance of the ERA5-pretrained model (C1). Adding low-fidelity simulation data from CMIP6 (i.e., CMCC and IFS-HR) improves performance almost uniformly (C2). Adding even more simulation data improves performance further on most surface variables and for the atmospheric levels present in this newly added data (C3). Finally, configuration C4, which contains good coverage of the entire atmosphere and also contains analysis data from GFS achieves the best overall performance with improvements across the board. (b) Pretraining on many diverse data sources improves the forecasting of extreme values at 6h lead time across all surface variables of IFS-HRES 2022. Additionally, the results also hold on wind speed, which is a nonlinear function of 10U and 10V. (c) Bigger models obtain lower validation loss for the same amount of GPU hours. We fit a power law that roughly translates into a 5 reduction in the training loss for every doubling of the model size.

A direct consequence of Aurora’s scale, both in terms of architecture design and training data corpus, as well as its pretraining and fine-tuning protocols, is its superior performance over the best specialized deep learning models. As an additional validation of the benefits of fine-tuning a large model pretrained on many datasets, we compare Aurora against GraphCast — pretrained only on ERA5 and currently considered the most skillful AI model at 0.25-degree resolution and lead times up to five days. Additionally, we include IFS HRES in this comparison, the gold standard in numerical weather prediction. We show that Aurora outperforms both when measured against analysis, weather station observations, and extreme values. 

Scorecard versus GraphCast at 0.25-degrees resolution. Aurora matches or outperforms GraphCast on 94% of targets. Aurora obtains the biggest gains (40%) over GraphCast in the upper atmosphere, where GraphCast performance is known to be poor. Large improvements up to 10-15% are observed at short and long lead times. The two models are closest to each other in the lower atmosphere at the 2--3 day lead time, which corresponds to the lead time GraphCast was rollout-finetuned on. At the same time, GraphCast shows slightly better performance up to five days and at most levels on specific humidity (Q).
Root mean square error (RMSE) for Aurora, GraphCast, and IFS-HRES as measured by global weather stations during 2022 for wind speed and surface temperature.
Thresholded RMSE for Aurora, GraphCast and IFS-HRES normalized by IFS-HRES performance. Aurora demonstrates improved prediction for the extreme values, or tails, of the surface variable distributions. In each plot values to the right of the centre line are cumulative RMSEs for targets found to sit above the threshold, and those to the left represent target values sitting below the threshold.
Figure 4: Aurora outperforms operational GraphCast across the vast majority of targets. (a) Scorecard versus GraphCast at 0.25-degrees resolution. Aurora matches or outperforms GraphCast on 94% of targets. Aurora obtains the biggest gains (40%) over GraphCast in the upper atmosphere, where GraphCast performance is known to be poor. Large improvements up to 10%-15% are observed at short and long lead times. The two models are closest to each other in the lower atmosphere at the 2-3 day lead time, which corresponds to the lead time GraphCast was rollout-finetuned on. At the same time, GraphCast shows slightly better performance up to five days and at most levels on specific humidity (Q). (b) Root mean square error (RMSE) and mean absolute error (MAE) for Aurora, GraphCast, and IFS-HRES as measured by global weather stations during 2022 for wind speed (left two panels) and surface temperature (right two panels). (c) Thresholded RMSE for Aurora, GraphCast and IFS-HRES normalized by IFS-HRES performance. Aurora demonstrates improved prediction for the extreme values, or tails, of the surface variable distributions. In each plot values to the right of the center line are cumulative RMSEs for targets found to sit above the threshold, and those to the left represent target values sitting below the threshold.

A paradigm shift in Earth system modeling 

The implications of Aurora extend far beyond atmospheric forecasting. By demonstrating the power of foundation models in the Earth sciences, this research paves the way for the development of comprehensive models that encompass the entire Earth system. The ability of foundation models to excel at downstream tasks with scarce data could democratize access to accurate weather and climate information in data-sparse regions, such as the developing world and polar regions. This could have far-reaching impacts on sectors like agriculture, transportation, energy harvesting, and disaster preparedness, enabling communities to better adapt to the challenges posed by climate change. 

As the field of AI-based environmental prediction evolves, we hope Aurora will serve as a blueprint for future research and development. The study highlights the importance of diverse pretraining data, model scaling, and flexible architectures in building powerful foundation models for the Earth sciences. With continued advancements in computational resources and data availability, we can envision a future where foundation models like Aurora become the backbone of operational weather and climate prediction systems, providing timely, accurate, and actionable insights to decision-makers and the public worldwide. 

Acknowledgements

We are grateful for the contributions of Cristian Bodnar, a core contributor to this project.

The post Introducing Aurora: The first large-scale foundation model of the atmosphere appeared first on Microsoft Research.

Read More

What’s Your Story: Weishung Liu

What’s Your Story: Weishung Liu

Microsoft Research Podcast | What's Your Story | Weishung Liu

In the Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today.

In this episode, Gehrke is joined by Principal PM Manager Weishung Liu. Liu brings product development and management expertise honed at companies such as Disney, Fluke, and SpaceX to her role at Microsoft, where she helped develop the real-time video analytics platform Watch For and today empowers teams within Microsoft Research to maximize their reach. She talks about how being more homebound as a child cultivated the love of people and stories that underlies her professional pursuits and how she landed in tech despite efforts to “rebel” against the expectations that come with growing up in Silicon Valley.

Photos of Weishung Liu, Principal PM Manager, throughout her life.

Transcript

[SPOT]

WEISHUNG LIU: Hey, listeners. I’m Weishung Liu, principal PM manager with Microsoft Research and today’s podcast guest. Before we get started, I want to tell you about Microsoft Research Forum. It’s a series of discussions and talks examining how the rapid advances in AI are impacting science and technology research. The next episode is June 4, and colleagues of mine from around Microsoft Research are participating. I highly recommend checking it out. You can learn more and register now at aka.ms/MyResearchForum. All right, here’s today’s show …

[END OF SPOT] [TEASER] 

[MUSIC PLAYS UNDER DIALOGUE] 

WEISHUNG LIU: I’ve always felt like I want the things that I work on to create joy in people. … The fact that I can still be here and create impact and do meaningful work and, you know, work on things that create joy and positively impact society, it speaks to me like stories speak to me.

[TEASER ENDS]

JOHANNES GEHRKE: Microsoft Research works at the cutting edge. But how much do we know about the people behind the science and technology that we create? This is What’s Your Story, and I’m Johannes Gehrke. In my 10 years with Microsoft, across product and research, I’ve been continuously excited and inspired by the people I work with, and I’m curious about how they became the talented and passionate people they are today. So I sat down with some of them. Now, I’m sharing their stories with you. In this podcast series, you’ll hear from them about how they grew up, the critical choices that shaped their lives, and their advice to others looking to carve a similar path.

[MUSIC FADES]


In this episode, I’m talking with Principal PM Manager Weishung Liu. Wei has used her love of storytelling and interest in people and their motivations to deliver meaningful products and customer experiences. This includes the creation of a successful line of Disney plush toys and contributions to the satellite internet system Starlink. With Microsoft, she helped develop Watch For, a real-time video analytics platform that has gone on to enhance gaming via streaming highlights and to support content moderation in products such as Xbox. Today, she’s facilitating connections and devising strategies to empower teams within Microsoft Research to maximize their reach. Here’s my conversation with Wei, beginning with her childhood in Silicon Valley.

JOHANNES GEHRKE: Hi, Wei. Welcome to What’s your Story. You’re our principal PM manager here in the lab, and we’ll talk in a little while about, you know, what you’re doing here right now, but maybe let’s start with, how did you actually end up in tech? Where did you grow up?

WEISHUNG LIU: Oh, wow. OK. So this is a very long, long and, like, nonlinear story about how I got into tech. So I grew up in Silicon Valley, which one would assume means just, like, oh, yes, you grew up in Silicon Valley; therefore, you must be in the STEM field, and therefore, you will be in tech for the rest of your life.

GEHRKE: Yep, that’s, sort of, a too familiar a story.

LIU: That’s a very linear story. And I totally actually wanted to rebel against that whole notion of going into tech. So I grew up in Silicon Valley and thought, like, man, I want to not do STEM.

GEHRKE: So did your parents want you to be either a doctor or engineer? Is that the … ?

LIU: Absolutely. It was either a doctor, engineer, or lawyer. So thankfully my sister went the PhD in psychology route, so she, kind of, checked that box for us. And so I was a little bit more free to pursue my very, very, very wide variety of interests. So a little bit of personal information about me. So I grew up a very sick child, and so I was hospitalized a lot. I was in the ER a lot. But that actually afforded me a lot of opportunities to be, sort of, an indoor-only child of reading and playing video games and all sorts of things that I would say, like, expanded my worldview. Like, it was just all sorts of different stories. Like, reading has stories; video games have stories.

GEHRKE: Tell us a story about reading and a story about video games. What …

LIU: Oh my goodness …

GEHRKE: … were your favorite set of books?

LIU: I was really interested in, like, historical fiction at the time. One book that I remember reading about—oh my gosh, it’s a very famous book, and I don’t remember the name anymore. However, it was about a young girl’s perspective of being, living in an internment camp, the Japanese internment camps, back during World War II, I believe, after Pearl Harbor.[1] And it was just kind of her diary and her perspective. It was almost like Diary of Anne Frank but from a Japanese American girl’s perspective instead. And I just loved, kind of, reading about different viewpoints and different eras and trying to understand, like, where do we overlap, how do things change over time, how does history repeat itself in some ways? And, and I love that. And then video games. So I was really into Japanese RPGs back in the day. So it’s funny. I started … my first console was a Mattel Intellivision II, and then it gradually went up to like Nintendo, Super Nintendo, all those, all those consoles. But I had a friend who I used to play RPGs with …

GEHRKE: So these were network RPGs or individual RPGs?

LIU: These were individual RPGs. This is, you know, when I was around 10, the internet appeared, so it probably dates me a little bit. Every time a new RPG came out like by—the company is now called Square Enix but back then it was called SquareSoft—or Nintendo like Zelda, he and I would immediately go out and buy the game or, you know, convince our parents at the time to buy the game, and then we would compete. So, like, this is not couch co-op; he was actually in Texas.

GEHRKE: Like long-distance co-op?

LIU: This is long-distance, long-distance gaming where we would compete to see who would beat the game first.

GEHRKE: Wow.

LIU: No, you’re not allowed to use walkthroughs. And he almost always beat me.

GEHRKE: But these games are like 60-hour, 80-hour games?

LIU: Yeah, like 60- or 80-hour games, but, like, you know, we got so good at them that, well, you had to figure out like how do you, kind of, bypass and get through the main quest as fast as possible. So that was always—

GEHRKE: So any of the side quests and things like that just … ?

LIU: Yeah, oh, yeah, no. So I’m actually a huge completionist, though, so I’d always go back after and do all the side quests to get, you know, we’ll just say “100 percent” achievement. I’m a little bit of an achievement machine that way. But so, like, that kind of stuff was always super fun for me. And so I spent so much of my time then—because I was, kind of, more homebound a lot—just exploring and being curious about things. And, and that got me into art and into design, and I thought, man, I’m going to be an architect someday because I love designing experiences, like spaces for people.

GEHRKE: You thought at that point in time like a real, like a building architect or an architect for like virtual worlds or so … ?

LIU: No, real, like a real physical space that people inhabit and experience. And so, like, I avoided as much STEM as I could in school. I couldn’t, just due to where I lived and grew up and the high school requirements that I had. But the minute I went to college, which happened to be at the University of Washington, which has a great architecture program, I was like, I’m never going to take another STEM class in my life.

GEHRKE: So you enrolled as an architecture major?

LIU: I enrolled as an architecture major, and I was like, I will do what we would call the “natural world” credits, which is kind of the STEM-like things. But I would intentionally find things that were not, like, hard science because I’m like, I’m never going to do this again. I’m never going to be in tech. All these people that are so obsessed with tech who, you know, went to MIT and Stanford, and I’m like, no, no, no, I’m going to be an architecture major.

GEHRKE: So you took, like, the physics for poets class or so …?

LIU: Stuff like that, right. [LAUGHS] Very, very similar. But I ended up just loving learning at school, which is very unsurprising. You know, I took, like, an Arabic poetry class. I took a French fairy tales class. And I just, kind of, explored college and all the things that it had to offer in terms of academics so much that I actually ended up deciding to get two degrees: one in industrial design, which is not too far away from architecture. Architecture is like with large spaces, like you build one building or design one building that lasts maybe 100 years. Industrial design, I, kind of, joke about it. It’s, you know, you design smaller form factors that sometimes, if they’re manufactured with plastics, last millions of years, [LAUGHS] and you build millions of them. But then I also ended up getting a degree in comparative religion, as well. Which it meant that, like, my schooling and my class schedules are always a little bit odd because I’d go from, you know, like, the industrial design shop down in our design building and like making things with my hands and working at the bandsaw, and then I’d, you know, rush to this other class where we have like very fascinating philosophical debates about various things in, sort of, the comparative religion space. And I’d write, you know, 10-page essays and … about all sorts of things. And, you know, there’s, like, the study of death is a great example and how different cultures react to death. But, you know, that was as far away from STEM [LAUGHS] as I could have possibly gone.

GEHRKE: Right. I was just thinking, can you maybe explain to our listeners a little bit who may come a little bit more from the STEM field traditionally, what do you study in comparative [religion], and what is the field like?

LIU: So for me, it was really just, like, I took a lot of classes just trying to understand people. I really … and it sounds, kind of, silly to say it that way, but religion is really formed and shaped by people. And so for me, like, the types of classes that I took were, sort of, like studying Western religion, studying Eastern religion, studying the philosophy of religion, like or even—and this still, I still think about it from time to time—how do you define religion? And just even … there’s still so many scholarly debates about how to define, like, what is a “pure” definition of religion, and nobody can really still identify that yet. Is it, you know, because then there’s this distinction of spiritualism and being religious versus something else or just completely made-up, you know, pseudoscience, whatever, right. People have this wide spectrum of things that they describe. But it’s really around learning about the different foundations of religion. And then people tend to specialize. You know, they might specialize in a particular area like Hinduism or, you know, broadly speaking, Eastern religions, or people will, you know, start focusing on Western religions. Or sometimes I think about a specific topic like the intersection of, for example, religion and death or religion and art or even, you know, religion and violence. And there’s a broad spectrum of things that people start specializing in. And it’s very, it’s, sort of, very much in the mind but very much in the heart of how you understand that.

GEHRKE: Yeah, I can see how it even connects to industrial design because there you also want to capture the heart …

LIU: Yes.

GEHRKE: … the hearts of people, right.

LIU: Yep. And that’s kind of how I, how I describe, you know, when people are like, why did you major in that? Like, what do you even do with that? Did you even think about what career you would have with that? I’m like, no, I just really wanted to learn, and I really wanted to understand people. And I felt like religion is one way to understand, sort of, like, sociologically how people think and get into that deep, like, that deep feeling of faith and where does it come from and how does it manifest and how does it motivate people to do things in life. And to your point, it’s very similar to industrial design because you’re, you know, we talk about design thinking and you have to really deeply understand the user and the people that you’re designing for in order to create something that really lasts, that matters to them. So that’s, kind of, my, at least my undergrad experience. And in a very, very brief way, I’ll just kind of walk through or at least tell you the very nonlinear path that I took to get to where I am here now at Microsoft Research. So like the day after I graduated from the University of Washington, I moved to Florida.

GEHRKE: And just as a question: so you graduated from the University of Washington—did you have like a plan, you know, this is like the career I want to have?

LIU: Oh no! So here’s the funny thing about design, and I hope that, you know, my other, the designers who might be watching or listening [LAUGHS] to this might not get upset—hopefully don’t get upset with me about this—is I love the design thinking aspect of design, like understanding why people do the things they do, what types of habits can you build with the products—physical products? I was very obsessed with physical, tangible things at the time. And then I learned through, like, internships and talking to other designers who were, you know, already in the field that that’s not what they do. That they don’t go and like, oh, let’s go talk to people and understand deeply what they do. Like, there’s other people that do that. OK, well, what do you do? Well, I work in, you know, CAD, or I work on SolidWorks, or I do Rhino, and I do surfacing. I’m like, OK, what else? Who decides what gets made? Oh, that’s like, you know, a product manager or product—oh, what’s that? Who? What? What does that even mean? Like, tell me more about that.

GEHRKE: So it’s like the dichotomy that you see even here in the company where the engineers have to, sort of, build the things, but the product managers are …

LIU: But someone else is …

GEHRKE: … in the middle

LIU: … someone else is, kind of, interpreting what the market and the users are saying, what the business is saying. And I was like, I like doing that because that’s more about understanding people and the business and the reason—the why. And so …

GEHRKE: Just before you go to your career, I mean, I must … I have to ask, what are some of the favorite things that you built during your undergrad? Because you said you really like to build physical things.

LIU: Oh my gosh!

GEHRKE: Maybe one or two things that you actually built …

LIU: Yeah …

GEHRKE: … that was, sort of, so fun.

LIU: So one of my projects was actually a Microsoft-sponsored project for one quarter, and all they showed up with—his name’s Steve Kaneko. He retired not too long ago from here. Steve showed up and said, I want you all to design a memory-sharing device.

GEHRKE: Interesting …

LIU: And that was it.

GEHRKE: So what is memory sharing? He didn’t define what that means?

LIU: He didn’t define it because as designers, that was our way of interpret—we had to interpret and understand what that meant for ourselves. And it was a very, very free-form exploration. And I thought … the place that I started from was … at the time, I was like, there’s like 6 or 7 billion people in the world. How many of them do I actually know? And then how many of them do I actually want to know or maybe I want to know better?

GEHRKE: To share a memory with …

LIU: To share my memories with, to share a part of me. Like, memories are …

GEHRKE: Pretty personal.

LIU: … who we are—or not who we are but parts of who we are—and drive who we become in some ways. And so I thought, you know, what would be cool is if you had a bracelet, and the bracelet were individual links, and each individual link was a photo, like a digital photo, very tiny digital photo, of something that you chose to share. And so, you know, I designed something at the time … like, the story I told was, like, well, you know, this woman who’s young decided to go to, you know, she’s taking the bus, and she put on her, like, “I wish to go to Paris” kind of theme, right. So she had a bunch of Parisian-looking things or something in that vein, right. And, you know, she gets on the bus and her bracelet vibrates. There’s, like, a haptic reaction from this bracelet. And that means that there’s someone else on the bus with this, you know, with a bracelet with their memories. It’s kind of an indicator that people want to share their stories with someone else. And, you know, wouldn’t it be great if, you know, this woman now sits down on the bus, because she sits next to the person who’s wearing it. Turns out to be an elderly woman who’s wearing, coincidentally, you know, her Paris bracelet, but it’s of her honeymoon of her deceased husband from many years ago. And, you know, like, think of the power of the stories that they could share with each other. That, you know, this woman, elderly woman, can share with, you know, this younger woman, who has aspirations to go, and the memories and the relationship that they can build from that. And so that was, kind of, my memory-sharing device at the time.

GEHRKE: I mean, it’s super interesting because, I mean, the way I think about this is that we have memory-sharing applications now like Facebook and Instagram and TikTok and so on, but they, the algorithm decides really …

LIU: Yes …

GEHRKE: … who to share it with and where and why to share it. Whereas here, it’s proximity, right? It somehow leads to this physical and personal connection afterwards, right? The connection is not like, OK, suddenly on my bracelet, her stories show up …

LIU: Yes …

GEHRKE: … but, you know, maybe we sit next to each other on the bus, and it vibrates, and then we start a conversation.

LIU: Exactly. It’s you own, you know, whatever content is on that you choose to have on your physical person, but you’re sharing yourself in a different way, and you’re sharing your memories and you’re sharing a moment. And it might just be a moment in time, right. It doesn’t have to be a long-lasting thing. That, you know, this elderly woman can say, hey, there’s this really great bistro that we tried on, you know, this particular street, and I hope it’s still there, because if you go, ask for this person or try this thing out and, like, what an incredible opportunity it is for this other woman, who, you know, maybe she does someday go to Paris and she does find it. And she thinks of that time, like, how grateful she was to have met, you know, this woman on the bus. And just for that brief whatever bus … however long that bus ride was, to have that connection, to learn something new about someone else, to share and receive a part of somebody else who you may never have known otherwise. And then that was, that was what I was thinking of, you know, in terms of a memory-sharing device was memory creates connections or it reinforces connections. So I guess very similarly to my people thing and being fascinated by people, like, this was my way of trying to connect people in a different way, in the space that they inhabit and not necessarily on their devices.

GEHRKE: And then what did Microsoft say to that? Was there like an end-of-quarter presentation?

LIU: Oh, yeah! There was a, there was a, you know, big old presentation. I can’t even remember which building we were at, but I think everybody was just like, wow, this is great. And that was it. [LAUGHTER]

GEHRKE: And that was it. It sounds like a really fascinating device.

LIU: Yeah, it was. And lots of people came up with all sorts of really cool things because everybody interpreted the, I’ll just say, the prompt differently, right.

GEHRKE: Right …

LIU: … And that was my interpretation of the prompt at the time.

GEHRKE: Well, super interesting.

LIU: Yeah.

GEHRKE: Coming back to, so OK, so you’ve done just a bunch of really amazing projects. You, sort of, it seems like you literally lived the notion of liberal education.

LIU: I did. I, like, even now I just love learning. I get my hands on all sorts of weird things. I picked up whittling as a random example.

GEHRKE: What is whittling? Do I even know what that is? [LAUGHS]

LIU: So whittling is basically carving shapes into wood. So … I’m also very accident prone, so there’s, like, lots of gloves I had to wear to protect my hands. But, you know, it was like, oh, I really just want to pick up whittling. And I literally did, you know. You can grab a stick and you can actually buy balsa wood that’s in a, in decent shape. But you can just start carving away at whatever … whatever you would like to form that piece of wood into, it can become that. So I made a cat, and then I made what I jokingly refer to as my fidget toy at home. It’s just a very smooth object. [LAUGHS]

GEHRKE: That you can hold and …

LIU: I just made it very round and smooth and you can just, kind of, like, rub it, and yeah, it’s …

GEHRKE: Super interesting.

LIU: … it’s … I pick up a lot of random things because it’s just fascinating to me. I learned a bunch of languages when I was in school. I learned Coptic when I was in school for no other reason than, hey, that sounds cool; you can read the Dead Sea Scrolls [LAUGHS] when you learn Coptic—OK!

GEHRKE: Wow. And so much, so important in today’s world, right, which is moving so fast, is a love for learning. And then especially directed in some areas.

LIU: Yeah.

GEHRKE: You know, that’s just really an awesome skill.

LIU: Yeah.

GEHRKE: And so you just graduated. You said you moved to Florida.

LIU: Oh, yes, yes. Yes. So, so about a month before this happened, right—it didn’t just spontaneously happen. A month before, I had a good friend from the architecture program who had said, hey, Wei, you know, I’m applying for this role in guest services at Disney. I was like, really? You can do that? And she’s like, yeah, yeah, yeah. So I was like, that sounds really cool. And I, you know, went to, like, the Disney careers site. I’m like one month or two months away from graduating. Still, like, not sure what I’m totally going to do because at that point, I’m like, I don’t think I want to be a designer because I don’t—the part that I love about it, the part that I have passion about, is not in the actual design of the object, but it’s about the understanding of why it needs to exist.

GEHRKE: The interconnection between the people and the design.

LIU: The people and the design, exactly. And so when I found, I found this, like, product development internship opportunity, and I was like, what does that even mean? That sounds cool. I get to …

GEHRKE: At Disney?

LIU: At Disney. And it was, like—and Disney’s tagline, the theme park merchandise’s tagline, was “creating tangible memories.” I was like, oh boy, this just checks all the boxes. So I applied, I interviewed, did a phone interview, and they hired me within 24 hours. They were like, we would like you to come. And I was like, I would absolutely love to move to Florida and work there. So, yeah, the day after I graduated from U-Dub, I drove all the way across the country from Seattle.

GEHRKE: You drove?

LIU: From Seattle with two cats.

GEHRKE: That must have been an interesting adventure by itself.

LIU: Oh, yes. With two cats in the car, let me tell you, it was fascinating. All the way to Florida, Orlando, Florida. And the day that I got there or, no, two days after I got there, I found out that I was going to be working in the toys area. So plush and dolls, which is, like, you can imagine just absolutely amazing. Making, like, stuffed toys that then—because my office was a mile down the road from Disney’s Animal Kingdom and therefore a couple miles away from Magic Kingdom or Hollywood Studios or EPCOT—I could actually go see, I’ll just say, the “fruits of my labor” instantly and not only that. See it bring joy to children.

GEHRKE: So what is the path? So you would design something, and how quickly would it then actually end up in the park? Or how did you, I mean, how did you start the job?

LIU: What did I do there? Yeah, yeah …

GEHRKE: Well, what’s the interface between the people and the design here?

LIU: Yeah … so, so, really, I didn’t actually do any design. There was an entire group called Disney Design Group that does all the designing there. And so what I did was I understood, what do we need to make and why? What memories are we—what tangible memories do we want to create for people? Why does it matter to them? In many ways, it’s, sort of, like, it’s still a business, right. You’re creating tangible memories to generate revenue and increase the bottom line for the company. But … so my role was to understand what trends were happening: what were the opportunities? What were guests doing in the parks? What types of things are guests looking for? What are we missing in our SKU lineup, or stock-keeping-unit lineup, and then in which merchandising areas do they need to happen? And so I, actually, as part of my internship, my manager said, hey, I let every intern every time they’re here come up with any idea they want, and you just have to see it from start to execution—in addition to all the other stuff that I worked on. I was like, sounds good. And I came up with this idea that I was like, you know, it would be cool … Uglydolls was really popular at the time. Designer toys were getting really popular from Kidrobot, which was kind of, like, there was this vinyl thing and you can—it was just decorative of all different art styles on the same canvas. And I was like, you know, what if we did that with Mickey, and then, you know, what if the story that we’re telling is, you know, just for the parks—Walt Disney World and Disneyland—that there were aliens or monsters coming to visit the park, but they wanted to blend in and fit in? Well, how would they do that? Well, they clearly see Mickey heads everywhere, and Mickey is very popular here clearly, and so they try to dress up like Mickey, but they don’t do it quite well. So they got the shape right, but everything else about them is a little bit different, and they all have their own unique personalities and …

GEHRKE: You can tell a story around them …

LIU: You can tell a story—see, it’s all about stories. And then it … I got buy-in from everybody there, like, all the way up to the VP. I had to get brand because I was messing with the brand icon. But, you know, it became an entire line called Mickey Monsters at Disney. I still have them all. There were two—then it went from plush; it became consumables, which are like edible things. It went into key chains. It went, it was super … it was … I probably went a little bit too hard, or I took the, I think, I took the assignment very seriously. [LAUGHS]

GEHRKE: Yep, yep. Well, it seemed to be a huge success, as well.

LIU: Yeah. It did really well in the time that it was there. We did a test, and I was really, really proud of it. But you know, my—what I did though is, you know, very concretely was I started with an idea. I, you know, convinced and aligned with lots of people in various disciplines that this is something that we should try and experiment on. You know, worked with the designers to really design what this could look like. You know, scoped out what types of fabrics because there’s all sorts of different textures out there. Working with, kind of, our sourcing team to understand, like, which vendors do we want to work with. And then typically, in the plush industry, manufacturing back in the day could happen—and in terms of supply chain, manufacturing, and then delivery of product—could take about six months.

GEHRKE: OK … 

LIU: And so when I was there, anything I worked on would, kind of, appear in six months, which is actually very cool. I mean, it’s not like software, where anything you work on is, you’re like boop, compile—oh look [there] it is. It depends on how fast your computer is. You know, it’s pretty instantaneous compared to six months to see the fruits of your labor. But it was a really, just such a great experience. And then seeing, you know, then going to the parks and seeing children with …

GEHRKE: Yeah, the stuff that you …

LIU: … the thing that I worked on, the thing that I had the idea on, and, like, them going like, Mom, I really want this.

GEHRKE: Right …

LIU: You know, we’re not really selling to the kids; we’re, kind of, selling to the parents.

GEHRKE: It’s a bit like this feeling that we can have here at Microsoft, right, if any of our ideas makes it into products …

LIU: Yup …

GEHRKE: … that are then used by 100 million people and hopefully bring them joy and connection.

LIU: Exactly. And that’s why, like, I just think Microsoft is great, because our portfolio is so broad, and so much of our work touches different parts of our lives. And I’ll even pick on, you know, like I have, you know, in my family, my daughter goes to school—clearly, obviously, she would go to school—but she used Flipgrid, now known as Flip, for a while. And I was like, hey, that’s cool. Like, she uses something that, you know, I don’t directly work on, but my company works on.

GEHRKE: Well, and you were involved with it through Watch For, right …

LIU: Yes, I was …

GEHRKE: … which did become the motivation for Flip.

LIU: Yep. Watch For, you know, helps to detect inappropriate content on Flip. And, you know, that’s super cool because now I’m like, oh, the work that I’m doing actually is directly impacting and helping people like my daughter and making a difference and, you know, keeping users safe from content that maybe we don’t want them to see. You know, other areas like Microsoft Word, I’m like, wow, this is a thing. Like, I’m at the company that makes the thing that I’ve used forever, and, you know, like, it’s just fascinating to see the types of things that we can touch here at Microsoft Research, for example. And how, you know, I, you know, Marie Kondo popularized the term “joy,” like, “sparking joy,” but …

GEHRKE: If you look at an item and if it doesn’t sparkle joy …

LIU: If it doesn’t spark joy, right …

GEHRKE: … then you know on which side it goes.

LIU: Exactly. But, but, you know, like, I’ve always felt like I want the things that I work on to create joy in people. And it was very obvious when you make toys that you see the joy on children’s faces with it. It’s a little bit different, but it’s so much more nuanced and rewarding when you also see, sort of, the products that, the types of things that we work on in research create joy. It’s, you know, it’s funny because I mentioned software is instantaneous in many ways, and then, you know, toys takes a little bit longer. But then, you know, in the types of research that we do, sometimes it takes a little bit longer than, a little bit longer [LAUGHS] …

GEHRKE: It takes years sometimes!

LIU: … than six months. Years to pay off. But, like, that return on that investment is so worth it. And, you know, I see that in, kind of, the work that lots of folks around MSR [Microsoft Research] do today. And knowing that even, sort of, the circles that I hang out in now do such crazy, cool, impactful things that help benefit the world. And, you know, it’s funny, like, never say never. I’m in tech and I love it, and I don’t have a STEM background. I didn’t get a STEM background. I didn’t get it, well, I don’t have a STEM degree. Like, I did not go—like, I can’t code my way out of a paper bag. But the fact that I can still be here and create impact and do meaningful work and, you know, work on things that create joy and positively impact society is, like, it speaks to me like stories speak to me.

GEHRKE: I mean, there’s so many elements that come together in what you’re saying. I mean, research is not a game of the person sitting in the lowly corner on her whiteboard, right? But it’s a team sport.

LIU: Yep.

GEHRKE: It requires many different people with many different skills, right? It requires the spark of ingenuity. It requires, you know, the deep scientific insight. It requires then the scaling and engineering. It requires the PM, right, to make actually the connection to the value, and the execution then requires the designer to actually create that joy with the user interface to seeing how it actually fits.

LIU: Exactly. And it’s fascinating that we sometimes talk about research being like a lonely journey. It can be, but it can also be such an empowering collaborative journey that you can build such incredible cool things when you bring people together—cross-disciplinary people together—to dream bigger and dream about new ideas and new ways of thinking. And, like, that’s why I also love talking to researchers here because they all have such unique perspectives and inner worlds and lives that are frankly so different from my own. And I think when they encounter me, they’re like, she’s very different from us, too.

GEHRKE: But I think these differences are our superpower, right, because …

LIU: Exactly. And that’s what brings us together.

GEHRKE: … they have to be bridged and that brings us together. Exactly. So how, I mean, if you think about Microsoft Research as over here. You’re here in Disney in Florida?

LIU: Yes, yes, yes. So …

GEHRKE: You had quite a few stops along the way.

LIU: I did have a lot of stops along the way.

GEHRKE: And very nonlinear also?

LIU: It was also very nonlinear. So Disney took me to the third, at the time, the third-largest toy company in the US, called JAKKS Pacific, where I worked on again, sort of, Disney-licensed and Mattel-licensed products, so “dress up and role play” toys is what we refer to them as. “Dress up” meaning, like, if you go to your local Target or Walmart or whatever, kind of, large store, they will have in their toy sections like dresses for Disney princesses, for example, or Disney fairies. Like, I worked on stuff like that, which is also very cool because, you know, usually around Halloween time here in the US is when I’m like, hey, I know that. And then that, kind of, took me to a video game accessory organization here in Woodinville.

GEHRKE: There’s the connection to tech starting to appear.

LIU: There’s a little bit connection of tech where I was like, I love video games! And I got to work on audio products there, as well, like headphones. And it was the first time I started working on things that, I’ll just say, had electrons running through them. So I had already worked on things that were, like, both soft lines—we refer to a soft line as bags and things that require, like, fabrics and textiles—and then I worked on hard lines, which were things that are more, things that are more physically rigid, like plastics. And so I was like, OK, well, I’ve worked on hard-lines-like stuff, and now I’m going to work on hard lines with electrons running through them. That’s kind of neat. And I learned all sorts of things about electricity. I was like, oh, this is weird and fascinating and circuits and … . And then I was like, well, this is cool, but … what else is there? And it took me to not a very well-known company in some circles, but a company called Fluke Corporation. Fluke is best known for its digital multimeters, and I worked there on their thermal imaging cameras. So it’s, for people who don’t know, it’s kind of like Predator vision. You can see what’s hot; you can see what’s not. It’s very cool. And Fluke spoke to me because their, you know, not only is their tagline “they keep your world up and running”; a lot of the things that Fluke does, especially when I heard stories from, like, electricians and technicians who use Fluke products, are like, this Fluke saved my life. I’m like, it did? What? And they’re like, you know, I was in a high-voltage situation, and I just wasn’t paying attention. I, you know, didn’t ground properly. And then there was an incident. But, you know, my multimeter survived, and more importantly, I survived. And you’re like, wow, like, that’s, that’s really cool. And so while I was at Fluke, they asked me if I wanted to work on a new IoT project. And I was like, I don’t even know what IoT is. “Internet of Things” … like, OK, well, you said “things” to me, and I like things. I like tangible things. Tell me more. And so that was, kind of, my first foray into things that had … of products with electrons on them with user interfaces and then also with software, like pure software, that were running on devices like your smartphones or your tablets or your computers. And so I started learning more about like, oh, what does software development look like? Oh, it’s a lot faster than hardware development. It’s kind of neat. And then that took me to SpaceX, of all places. It was super weird. Like, SpaceX was like, hey, do you want to come work in software here? I was like, but I’m not a rocket scientist. They’re like, you don’t need to be. I was like, huh, OK. And so I worked on Starlink before Starlink was a real thing. I worked on, kind of, the back-office systems for the ISP. I also worked on what we would refer to as our enterprise resource planning system that powers all of SpaceX. It’s called Warp Drive.

GEHRKE: That’s where you got all your software experience.

LIU: That’s where I learned all about software and working on complex systems, also monoliths and older systems, and how do you think about, you know, sometimes zero-fault tolerance systems and also, that also remain flexible for its users so they can move fast. And then from SpaceX, that took me to a startup called Likewise. It’s here in Bellevue. And then from the startup, I was like, I really like those people in Microsoft. I really want to work in research because they come up with all these cool ideas, and then they could do stuff with it. And I’m such an idea person, and maybe I’m pretty good at execution, but I love the idea side of things. And I discovered that over the course of my career, and that’s actually what brought me here to begin with.

GEHRKE: And that’s, sort of, your superpower that you bring now here. So if I think about a typical day, right, what do you do throughout, throughout your day? What is it, what is it to be a PM manager here at MSR?

LIU: So it’s funny because when I was just a PM and not a manager, I was more, kind of, figuring out, how do I make this product go? How do I make this product ship? How do I move things forward and empower organizations with the products that I—people and organizations on the planet to achieve more [with] what I’m working on? And now as a PM manager, I’m more empowering the people in my team to do that and thinking about uniquely like, who are they, what are their motivations, and then how do I help them grow, and then how do I help their products ship, and how do I help their teams cohere? And so really my day-to-day is so much less, like, being involved in the nitty-gritty details of any project at any point in time, but it’s really meeting with different people around Microsoft Research and just understanding, like, what’s going on and making sure that we’re executing on the impactful work that we want to move forward. You know, it’s boring to say it’s—it doesn’t sound very interesting. Like, mostly, it’s emails and meetings and talking, and, you know, talking to people one-on-one, occasionally writing documents and creating artifacts that matter. But more importantly, I would say it’s creating connections, helping uplift people, and making sure that they are moving and being empowered in the way that they feel that—to help them achieve more.

GEHRKE: That’s super interesting. Maybe in closing, do you have one piece of career advice for everybody, you know, anybody who’s listening? Because you have such an interesting nonlinear career, yet when you are at Disney you couldn’t probably … didn’t imagine that you would end up here at MSR, and you don’t know what, like, we had a little pre-discussion. You said you don’t know where you’re going to go next. So what’s your career advice for any listener?

LIU: I would say, you know, if you’re not sure, it’s OK to not be sure, and, you know, instead of asking yourself why, ask yourself why not. If you look at something and you’re like, hey, that job looks really cool, but I am so unqualified to do it for whatever reason you want to tell yourself, ask yourself why not. Even if it’s, you know, you’re going from toys to something in STEM, or, you know, I’m not a rocket scientist, but somehow, I can create value at SpaceX? Like, if you want to do it, ask yourself why not and try and see what happens. Because if you stop yourself at the start, before you even start trying, then you’re never going to find out what happens next.

[MUSIC]

GEHRKE: It’s just such an amazing note to end on. So thank you very much for the great conversation, Wei.

LIU: Yeah. Thanks, Johannes.

GEHRKE: To learn more about Wei or to see photos of her work and of her childhood in Silicon Valley, visit aka.ms/ResearcherStories (opens in new tab).

[MUSIC FADES]


[1] Liu notes the book was Journey to Topaz by Yoshiko Uchida and the subsequent book Journey Home.

The post What’s Your Story: Weishung Liu appeared first on Microsoft Research.

Read More

The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.

Introduction

In today’s data-driven world, organizations strive to leverage data to train and adapt AI models. However, this pursuit often faces an important challenge: balancing the value of data with the need to safeguard individuals’ right to privacy and comply with data privacy regulations like the General Data Protection Regulation (opens in new tab) (GDPR) and the EU AI Act (opens in new tab)

Synthetic data has emerged as a powerful solution to privacy and compliance challenges. It allows organizations to create realistic and useful datasets, tailored to specific use cases, without compromising individual privacy. This enables organizations to:

  • Train and adapt AI models: Synthetic data can be used to train and adapt models to specific domains and industries, even when real-world data is limited, or privacy concerns exist.
  • Comply with regulations: Since it doesn’t require user data, synthetic data generation helps organizations adhere to data privacy regulations.
  • Unlock new possibilities: Synthetic data opens doors to innovative AI applications that were previously limited by data availability or privacy constraints.

Microsoft’s Phi-3 (opens in new tab) small language model (SLM) is a good example of how synthetic data can contribute to responsible AI development, enabling the creation of powerful language models without compromising privacy. Phi-3 leverages a combination of “textbook quality” web data and LLM-generated synthetic content, creating a strategic approach that doesn’t need real-world personal data. 

However, synthetic data carries limitations. It can be difficult to artificially generate realistic data that anticipates a wide range of use cases and individual scenarios. Furthermore, synthetic data generated by pre-trained large-language models (LLMs) can sometimes reduce accuracy and increase bias on down-stream tasks (opens in new tab). So, how could we generate synthetic data that accurately captures the diversity and specificity of private data while maintaining strict privacy protections for data contributors? 

Differential privacy: A bridge between innovation and privacy

Differentially private (DP) synthetic data generation is a promising solution. It allows developers to pursue innovations in machine learning while prioritizing privacy. The goal of synthetic data generation is to produce data statistically similar to real-world data sources. However, when the data is too similar, replicating uniquely identifying details of the source data, the promise of preserving privacy is compromised. This is where DP can help. DP is a mathematical framework for providing a guarantee that a particular computation is relatively invariant to the addition or removal of a single data contributor. Using DP techniques, researchers can generate synthetic datasets that retain the statistical properties of the original data while ensuring that information that could help identify data contributors remains obscured. 

This blog post explores recent advancements in private synthetic data generation. We examine four recently published research papers that propose innovative techniques for generating synthetic data with strong privacy guarantees, while maintaining its usefulness for analytics, training AI models, and other tasks.

In the remainder of this blog post, we describe each approach in more detail, and present experimental results illustrating their value.

Technical deep dive: Differentially private synthetic data generation 

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Generative LLMs offer the opportunity to produce synthetic text by sampling from LLM outputs. One avenue to generating realistic synthetic text is to fine-tune an LLM using representative data. For example, we could consider fine-tuning a pre-trained LLM on a corpus of scientific papers, enabling the model to more readily produce text that captures the knowledge and writing style used in scientific writing. Suppose, however, that we want to produce synthetic text based on a private corpus of documents. What steps can we take to protect the document authors and any sensitive information in their documents? For example, we may want to produce synthetic medical notes, or personal emails. LLMs have a well-known capacity to memorize training examples, and a model with the potential for reproducing samples from the training set might pose significant privacy risks.

In the paper Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe, researchers from Microsoft presented an approach to leveraging a private data corpus for synthetic generation, without compromising the privacy of the data subjects. This approach uses differentially private stochastic gradient descent (DP-SGD) to fine-tune an LLM on the private documents with a strong privacy guarantee. Differentially private model training provides a mathematical guarantee that the trained model parameters, and any subsequent model outputs, are relatively unaffected by the addition or removal of any single user’s training examples.

The synthetic generation approach described in this work was validated by training on restaurant reviews with varying levels of privacy protection, then prompting the model to generate novel reviews. These reviews were then used for downstream classification tasks, such as sentiment prediction and restaurant genre classification, and the results, which are shown in Table 1, demonstrated only small accuracy penalties compared to training on the raw private data. This approach unlocks a powerful way for realistic synthetic data to be generated from private data without compromising privacy or confidentiality.

A flow chart with four successive blocks. Starting with a data owner, private data is provisioned to train a language model with differential privacy. The language model is subsequently prompted to generate novel synthetic data resembling the private data. This data can be used for down-stream applications such as machine learning, feedback analysis or statistical analysis.
Figure 1: By fine-tuning an LLM with differential privacy, the model can be used to generate synthetic examples that resemble the private corpus 
A table of results with four columns and four rows. The columns indicate data type, data generator, epsilon, rating and category.  The first row indicates “original” data type and no entry for data generator or epsilon. The rating is 0.733 and category is 0.775.  The following three rows all indicate Synthetic for data type and GPT2, GPT2-Medium, and GPT2-Large for the data generator. Each row is further divided into two rows corresponding to epsilon = 4 and epsilon = infinity respectively. In all cases the rating and category scores are lower than the row marked original by a few percentage points. The rows corresponding to epsilon = 4 are lower than corresponding rows marked epsilon=infinity by 1-2 percentage points. In general the epsilon = 4 rows have increased scores for larger GPT2 models, while the epsilon=infinity rows are relatively flat.
Table 1: Various versions of GPT-2 were trained on restaurant reviews both with (ε=4) and without (ε =∞) a privacy guarantee. These models were used to produce synthetic training sets, which were used to train classification models for review rating and restaurant category, and subsequently evaluated for accuracy on a private hold-out set. The results show that models trained on the synthetic data can achieve accuracy competitive with models trained without a privacy guarantee. 

Differentially Private Synthetic Data via Foundation Model APIs

While the ACL paper demonstrated a robust approach to synthetic data generation, fine-tuning a large model can be impractical. Model training requires significant computing capacity and some of the most powerful models available are proprietary and not accessible for DP training. Recognizing this challenge, researchers at Microsoft explored whether synthetic data can be generated directly using only inference API access to a model, even while utilizing an untrusted model controlled by a third party. Crucially, the synthetic data should resemble a targeted private corpus, and yield a similar DP guarantee as was met in the previous work based on model training. In two separate papers, the authors demonstrate an approach to this problem using a differentially private sampling approach called Private Evolution (PE). 

Two independent flow charts. In the first, private data is applied to a pre-trained model using DP-SGD. The fine-tuned model is used to produce differentially private synthetic data.  In the second chart, a pre-trained model is prompted via its API to produce generic data. Private data is used to inform selection of the generated data, with a strong privacy guarantee, yielding differentially private synthetic data.
Figure 2: Instead of fine-tuning pre-trained models with DP-SGD (top figure), Private Evolution (PE) only requires accessing the inference APIs of a model (bottom figure). Thus, PE is easily compatible with foundation models that are difficult to DP-fine-tune (e.g., because they are too large) or infeasible to fine-tune (e.g., they are only accessible through inference APIs).

Synthetic image generation using foundation model APIs: In Differentially Private Synthetic Data via Foundation Model APIs 1: Images, the authors introduced Private Evolution (PE), an approach that enables DP image synthesis merely through inference APIs of a generative model. PE operates by sampling from a pre-trained diffusion model such as Stable Diffusion, which has no knowledge of the private corpus. PE then iteratively compares these samples to the private corpus, keeps the ones that are most similar to the private corpus, and uses the pre-trained model to generate more such samples. Crucially, the comparison to the private corpus is done with a DP guarantee, so that any information revealed about the private corpus is strictly bounded. Also, all the queries to the foundation model APIs satisfy the same DP guarantee, so that we can safely use APIs provided by (untrusted) third parties. 

Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 
Figure 3: Overview of PE. We use two private and synthetic images for illustration. Step 1 (RANDOM_API): we use the model API to generate random images. Step 2: We iteratively go through steps 2.1-2.3 to refine the synthetic images towards the private images. Step 2.1: Each private image votes for their closet synthetic image in the embedding space. In this example, we assume that the bird image gets two votes, and the car image gets zero votes. We then add Gaussian noise to the votes to ensure DP. This gives us the DP Nearest Neighbor Histogram (DP_NN_HISTOGRAM). Step 2.2: We resample the generated images proportional to the histogram. We assume that only the bird image remains. Step 2.3 (VARIATION_API): We use the model API to generate new similar images to the bird image, which are the initial synthetic images in the next iteration. 

Even without doing any model training, PE significantly advances state-of-the-art results on some of the datasets. For example, on CIFAR10 dataset (opens in new tab), we achieve FID score (image quality measure, smaller is better) ≤ 7.9 with DP privacy cost ϵ = 0.67, significantly improving the previous SOTA from ϵ = 32. In the paper, we also show that PE requires less computational resource (GPU hours) than DP fine-tuning to achieve such results. 

A 2D line chart with six line series, comprising conditional and unconditional variations on the private evolution and DP-MEPF methods, as well as DP-GAN and DP-Diffusion. The x axis presents values of epsilon from 0 to 32. The y axis presents values of the image quality measure FID from 0 to 80, where values are better.  All six series show decreasing values of FID for increasing values of epsilon. Both of the series corresponding to private evolution show significantly lower FID values, ranging from about epsilon = 0.1 to epsilon = 2.
Figure 4: FID (image quality measure, lower is better) vs. DP privacy cost ϵ on CIFAR10 (δ = 10−5 ). (Un)cond means (un)conditional generation. Ours achieves the best privacy-quality trade-off compared to prior training-based approaches.
An array of ten rows of thumbnails, each row depicting ten instances of generated synthetic images. The rows include birds, cars, cats, dogs, and other animals, planes, boats and trucks.  Most of the images appear to be realistic with some exhibiting unusual artifacts.
Figure 5: Private Evolution-generated samples using CIFAR-10 as the private corpus (ε =0.67, δ =10-5). Each row corresponds to one object class.

Synthetic Text Generation using foundation model APIs: the PE approach described above works well for images since it is easy to produce nearby perturbations of promising images. In Differentially Private Synthetic Data via Foundation Model APIs 2: Text, Microsoft researchers explored whether a similar approach could be applied to text. Their method, called Augmented Private Evolution (Aug-PE), operates similarly to the basic PE approach, but leverages the power of a pre-trained LLM to produce variations and re-wordings of input text. Aug-PE also proposes some fundamental algorithmic improvements that may benefit future development of PE. 

An overview of the Augmented Private Evolution algorithm for synthetic text generation. Step 1 invokes a language model to produce random text. Step 2.1 uses private data and differential private to vote on the best candidates from step 1, Step 2.2 samples from this differentially private histogram to produce a selected set of generations. Step 2.3 prompts a language model to produce variants of the selected generations, and steps 2.1 to 2.3 are repeated.
Figure 6: Augmented Private Evolution (Aug-PE) leverages a foundational LLM to synthesize text and compare in a privacy-preserving way with a private corpus. Similar to PE for images, in Aug-PE, samples that more closely resemble the private data are retained and refined to produce new synthetic text with a strong privacy guarantee. The illustration shows how we generate DP synthetic reviews for restaurants given two private samples.

Results show that Aug-PE is a promising alternative to DP-fine-tuning for DP text synthesis. With the same foundation model, PE can match or even beat DP-fine-tuning in terms of the trade-off between text quality and privacy. Moreover, as Aug-PE only requires inference APIs, Aug-PE can easily work with the most advanced LLMs such as GPT-3.5, LLaMA, and Mixtral to further improve the text quality. In terms of computational cost (GPU hours), PE can achieve up to 65.7x speedup compared to the DP fine-tuning approach.

A table of results for area and rating classification accuracy for a variety of models and comparing PE with DP synthesis. The table contains the remark that with the same model PE matches or beats DP fine-tuning on text quality vs privacy, and PE works well with advanced LLMs which may be challenging or impossible to fine-tune. The models compared include three sizes of GPT-2, several major open source models, and GPT-3.5. PE on the Mixtral model shows the strongest Area classification accuracy at 43.6 while PE on GPT-3.5 shows the strongest Rating classification accuracy at 43.1.
Table 2: Results on ICLR 2023 paper reviews (ϵ = 1). We use each method to generate DP synthetic paper reviews and test the utility of the data by training downstream paper area or rating classifiers and evaluate their accuracies on the real hold-out data (higher is better). Under the same base model (GPT-2 families), PE achieves competitive results with DP fine-tuning. PE also supports advanced LLMs that may be challenging to work with DP fine-tuning due to large model sizes or black box access.

Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation

In-context learning is a technique for performing tasks with an LLM by providing a sample of demonstration examples in the prompt of the LLM before presenting it with a specific task. For example, we might show a few movie plots and their genre and ask the LLM to suggest the genre for a particular plot of interest. In-context learning harnesses the strong generalization capabilities of LLMs, but it requires a sample of labeled demonstration examples at inference time. How can we perform in-context learning when the only available labeled examples are private? A naïve solution might be to use the private examples but hide the demonstration prompt from the user. However, the threat posed by jailbreak attacks puts these examples at risk for exposure to a malicious user.

In Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation, Microsoft researchers explored how demonstration examples can be synthesized from a private corpus with a privacy guarantee. The method operates by incrementally drawing samples from a token distribution defined by the private examples but with noise added to the distribution. The noise is calibrated to ensure a bound on the privacy lost with each sample. The research demonstrated that in-context learning can out-perform zero-shot learning (querying a model without any demonstration examples) and comes close to performing at the same level as the case with no privacy mitigations, as shown in Table 3. 

An overview of differentially private few-shot generation.  A round of token generation is depicted with four steps. Given the tokens generated so far, step 1 selects the relevant private data. Step 2 takes an M by N sample of the private data, producing M batches of N examples. Step 3 assembles M LLM prompts with task instructions and the N examples appended. Step 4 feeds the M prompts to the LLM and performs noisy aggregation over the LLM’s output probabilities to select the next generated token.
Figure 7: Illustration of DP few-shot generation. The example shows a synthetic demonstration generated token by token for the topic school with a differentially private guarantee. As new tokens are sampled, the private examples inform the sampling probability of each subsequent token, with noise injected to preserve privacy. 
A table of results for private in-context learning tasks, including text classification on three datasets (AGNews, DBPedia, and TREC) and information extraction on two datasets (MIT-G and MIT-D).  Accuracy is compared across two cases where epsilon = 0 (zero-shot and four-shot) and values of epsilon at 1, 2, 4, 8 and infinity. Generally, accuracy improves as epsilon increases but epsilon = 8 often outperforms epsilon = infinity.
Table 3: For classification and information extraction tasks, DP in-context learning achieves accuracy similar to non-private ICL (ϵ =∞) 

Conclusion

Synthetic data generation presents enormous opportunities to develop AI systems without compromising end-user privacy. In this blog post, we have explored recent innovations in synthetic data generation with strong privacy guarantees. These approaches can enable practitioners to produce synthetic data from private entities, while mitigating the risk that private information might be revealed. While these approaches are highly promising, they do have limitations. For example, we are currently limited to producing relatively short text passages. Future work will continue to explore the opportunities presented by these approaches, with an aim to produce increasingly realistic data with strong privacy guarantees.

Acknowledgments: The authors are grateful for the contributions of the co-authors of the papers reviewed in this blog post: Xiang Yue, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, Chulin Xie, Arturs Backurs, Sivakanth Gopi, Da Yu, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Janardhan Kulkarni, Xinyu Tang, Richard Shin, Andre Manoel, and Niloofar Mireshghallah.

The post The Crossroads of Innovation and Privacy: Private Synthetic Data for Generative AI appeared first on Microsoft Research.

Read More

Research Focus: Week of May 27, 2024

Research Focus: Week of May 27, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: May 27, 2024

Register now for Research Forum on June 4

Join us for Research Forum (opens in new tab), an event series that explores recent research advances, bold new ideas, and important discussions with the global research community in the era of general AI. 

In Episode 3, researchers at Microsoft emphasize the importance of globally equitable AI, and will share novel use cases, transformative applications from industry to material design, and provide updates on AutoGen and MatterGen. 

Your registration includes access to our live chat with researchers on the event day. 

Episode 3 will air Tuesday, June 4 at 9:00 AM PT.

Generative AI and the Politics of Visibility

Generative AI tools have a remarkable capacity to produce complicated and lengthy texts, with just simple direction from users. AI proponents assert they can help writers, providing creative suggestions, completing half-written sentences or story fragments, and inventing character backstories. But this raises questions about the politics of visibility: what kinds of stories do these tools tend to generate, and what do they generally leave out? Do these tools fully represent diverse or marginalized populations and non-normative communities?

In a recent paper: Generative AI and the Politics of Visibility, a researcher from Microsoft tested three widely available generative AI tools (Bing Chat, ChatGPT, and Google’s Bard, now Gemini) with prompts designed to reveal their normative assumptions, prompting the tools multiple times with each to track the diversity of the outputs to the same query. His research demonstrates that, at least as currently designed and trained, generative AI tools tend to reproduce normative identities and narratives, rarely representing less common arrangements and perspectives unless specifically prompted. When they do generate variety, it is often narrow, maintaining deeper normative assumptions in what remains absent.


Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episodes 1 & 2 on-demand.


ACM MMSys 2024 Bandwidth Estimation in Real Time Communications Challenge

Videoconferencing has become indispensable for everything from global business operations to accessible education, transforming the way people communicate across physical barriers and geographical divides. The quality of experience (QoE) delivered by video conferencing systems depends in part on correctly estimating the capacity of the bottleneck link between the sender and the receiver over time. Bandwidth estimation for real-time communications (RTC) remains a significant challenge, primarily due to the continuously evolving heterogeneous network architectures and technologies. From the first bandwidth estimation challenge hosted by Microsoft at ACM MMSys 2021, researchers learned that bandwidth estimation models trained with reinforcement learning (RL) in simulations to maximize network-based reward functions may not be optimal, due to the sim-to-real gap and the difficulty of aligning network-based rewards with user-perceived QoE. In this year’s ACM MMSys 2024 Bandwidth Estimation in Real Time Communications Challenge, researchers from Microsoft aim to align reward maximization with user-perceived QoE optimization using offline RL and a real-world dataset released by Microsoft Teams. The challenge received enthusiastic participation from both academia and industry. All models submitted to the grand challenge underwent initial evaluation, and top models were further evaluated on a geographically distributed testbed. Challenge results show that by leveraging real-world data and integrating objective audio/video quality scores as rewards, offline RL can facilitate the development of competitive bandwidth estimators for RTC.


Player-Driven Emergence in LLM-Driven Game Narrative

Game creation is a labor-intensive process, with limited automation of non-graphic game elements related to dialogue and narrative structure. These elements are typically hand-coded and rigidly deterministic, with few options presented to the player. Large language models (LLMs) are beginning to show potential in the creation of richer and more expansive narrative spaces. 

In a recent paper: Player-Driven Emergence in LLM-Driven Game Narrative, accepted for presentation at the IEEE Conference on Games 2024, researchers from Microsoft in collaboration with members of the Xbox organization explore how interaction with LLMs can empower players to participate in the evolution of game narratives. As a testbed, they created a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise but can freely interact with non-player characters generated by GPT-4, a state-of-the-art LLM. They recruited 28 gamers to play the game and used GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player’s gameplay. Through their interactions with the non-deterministic behavior of the LLM, players were able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.


Segmentation using large language models: A new typology of American neighborhoods

The U.S. Census Bureau’s American Community Survey (ACS) is the country’s primary source of social and economic data. But much of the data is low quality, especially at the highest levels of geographic detail (Block Groups). As one zooms in geographically on a map, the resolution of social and economic data decreases, which is counterintuitive. Typically, zooming in generates more detail, not less. Recent changes in the U.S. statistical system have amplified this geographic-demographic resolution trade-off.

In a recent paper: Segmentation using large language models: A new typology of American neighborhoods, researchers from Microsoft present a solution to this problem in the form of an AI-based open and reproducible geodemographic classification system using small area estimates from the ACS. They employ a partitioning clustering algorithm to a range of socio-economic, demographic, and built environment variables. Using an open-source software pipeline ensures adaptability to future data updates. One key innovation is the integration of GPT-4, to generate intuitive cluster descriptions and names. This represents a novel application of natural language processing in geodemographic research and showcases the potential for human-AI collaboration within the geospatial domain.


From Local to Global: A Graph RAG Approach to Query-Focused Summarization 

The use of retrieval-augmented generation (RAG) to retrieve relevant information from an external knowledge source enables LLMs to answer questions over private and/or previously unseen document collections. However, RAG fails on global questions directed at an entire text corpus, such as: “What are the main themes in the dataset?”, since this is inherently a query-focused summarization (QFS) task, rather than an explicit retrieval task. Prior QFS methods fail to scale to the quantities of text indexed by typical RAG systems.

In a recent preprint: From Local to Global: A Graph RAG Approach to Query-Focused Summarization, researchers from Microsoft propose combining the strengths of these contrasting methods through a Graph RAG approach to question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. This approach uses an LLM to build a graph-based text index in two stages: first to derive an entity knowledge graph from the source documents, then to pre-generate community summaries for all groups of closely-related entities. Given a question, each community summary is used to generate a partial response, before all partial responses are again summarized in a final response to the user. For a class of global sensemaking questions over datasets in the 1 million token range, Graph RAG leads to substantial improvements over a naïve RAG baseline for both the comprehensiveness and diversity of generated answers.

Microsoft Research in the news


Microsoft Announces New Foundation Model For Digital Pathology, Diving Deeper Into Clinical Medicine  

Forbes | May 22, 2024

In partnership with Providence health system and the University of Washington, Microsoft has leveraged its significant work with generative AI to launch GigaPath, the first whole-slide foundation model for digital pathology that has been pre-trained with real-world data.


Spanish mini-satellites bring the internet to isolated areas (en español)  

La Razon | May 17, 2024

The Spanish company Fossa, with help from Microsoft Research, has successfully tested a small satellite weighing less than a kilogram that improves connectivity in places with little or no coverage, a potential boost for the internet of things (IoT).

The post Research Focus: Week of May 27, 2024 appeared first on Microsoft Research.

Read More

Ideas: Designing AI for people with Abigail Sellen

Ideas: Designing AI for people with Abigail Sellen

Microsoft Research Podcast | Ideas | Abigail Sellen

Behind every emerging technology is a great idea propelling it forward. In the new Microsoft Research Podcast series, Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.  

In this episode, host Gretchen Huizinga talks with Distinguished Scientist and Lab Director Abigail Sellen. The idea that computers could be designed for people is commonplace today, but when Sellen was pursuing an advanced degree in psychology, it was a novel one that set her on course for a career in human-centric computing. Today, Sellen and the teams she oversees are studying how AI could—and should—be designed for people, focusing on helping to ensure new developments support people in growing the skills and qualities they value. Sellen explores those efforts through the AI, Cognition, and the Economy initiative—or AICE, for short—a collective of interdisciplinary scientists examining the short- and long-term effects of generative AI on human cognition, organizational structures, and the economy.


Learn more:

AI, Cognition, and the Economy (AICE) 
Initiative page 

Responsible AI Principles and Approach | Microsoft AI  

The Rise of the AI Co-Pilot: Lessons for Design from Aviation and Beyond 
Publication, 2023 

The Myth of the Paperless Office 
Book, 2003

Transcript

[SPOT] 

GRETCHEN HUIZINGA: Hey, listeners. It’s host Gretchen Huizinga. Microsoft Research podcasts are known for bringing you stories about the latest in technology research and the scientists behind it. But if you want to dive even deeper, I encourage you to attend Microsoft Research Forum. Each episode is a series of talks and panels exploring recent advances in research, bold new ideas, and important discussions with the global research community in the era of general AI. The next episode is coming up on June 4, and you can register now at aka.ms/MyResearchForum (opens in new tab). Now, here’s today’s show. 

[END OF SPOT] 

[TEASER]  

[MUSIC PLAYS UNDER DIALOGUE] 

ABIGAIL SELLEN: I’m not saying that we shouldn’t take concerns seriously about AI or be hugely optimistic about the opportunities, but rather, my view on this is that we can do research to get, kind of, line of sight into the future and what is going to happen with AI. And more than this, we should be using research to not just get line of sight but to steer the future, right. We can actually help to shape it. And especially being at Microsoft, we have a chance to do that. 

[TEASER ENDS] 

GRETCHEN HUIZINGA: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. I’m Dr. Gretchen Huizinga. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.


[MUSIC FADES] 

My guest on this episode is Abigail Sellen, known by her friends and colleagues as Abi. A social scientist by training and an expert in human-computer interaction, Abi has a long list of accomplishments and honors, and she’s a fellow of many technical academies and societies. But today I’m talking to her in her role as distinguished scientist and lab director of Microsoft Research Cambridge, UK, where she oversees a diverse portfolio of research, some of which supports a new initiative centered around the big idea of AI, Cognition, and the Economy, also known as AICE. Abi Sellen. I’m so excited to talk to you today. Welcome to Ideas

ABIGAIL SELLEN: Thanks! Me, too. 

HUIZINGA: So before we get into an overview of the ideas behind AICE research, let’s talk about the big ideas behind you. Tell us your own research origin story, as it were, and if there was one, what big idea or animating “what if?” captured your imagination and inspired you to do what you’re doing today? 

SELLEN: OK, well, you’re asking me to go back in the mists of time a little bit, but let me try. [LAUGHTER] So I would say, going … this goes back to my time when I started doing my PhD at UC San Diego. So I had just graduated as a psychologist from the University of Toronto, and I was going to go off and do a PhD in psychology with a guy called Don Norman. So back then, I really had very little interest in computers. And in fact, computers weren’t really a thing that normal people used. [LAUGHTER] They were things that you might, like, put punch cards into. Or, in fact, in my undergrad days, I actually programmed in hexadecimal, and it was horrible. But at UCSD, they were using computers everywhere, and it was, kind of, central to how everyone worked. And we even had email back then. So computers weren’t really for personal use, and it was clear that they were designed for engineers by engineers. And so they were horrible to use, people grappling with them, people were making mistakes. You could easily remove all your files just by doing rm*. So the big idea that was going around the lab at the time—and this was by a bunch of psychologists, not just Don, but other ones—was that we could design computers for people, for people to use, and take into account, you know, how people act and interact with things and what they want. And that was a radical idea at the time. And that was the start of this field called human-computer interaction, which is … you know, now we talk about designing computers for people and “user-friendly” and that’s a, kind of, like, normal thing, but back then … 

HUIZINGA: Yeah … 

SELLEN: … it was a radical idea. And so, to me, that changed everything for me to think about how we could design technology for people. And then, if I can, I’ll talk about one other thing that happened … 

HUIZINGA: Yeah, please.

SELLEN: … during that time. So at that time, there was another gang of psychologists, people like Dave Rumelhart, Geoff Hinton, Jay McClelland, people like that, who were thinking about, how do we model human intelligence—learning, memory, cognition—using computers? And so these were psychologists thinking about, how do people represent ideas and knowledge, and how can we do that with computers?  

HUIZINGA: Yeah … 

SELLEN: And this was radical at the time because cognitive psychologists back then were thinking about … they did lots of, kind of, flow chart models of human cognition. And people like Dave Rumelhart did networks, neural networks, … 

HUIZINGA: Ooh … 

SELLEN: and they were using what were then called spreading activation models of memory and things, which came from psychology. And that’s interesting because not only were they modeling human cognition in this, kind of, what they called parallel distributed processing, but they operationalized it. And that’s where Hinton and others came up with the back-propagation algorithm, and that was a huge leap forward in AI. So psychologists were actually directly responsible for the wave of AI we see today. A lot of computer scientists don’t know that. A lot of machine learning people don’t know that. But so, for me, long story short, that time in my life and doing my PhD at UC San Diego led to me understanding that social science, psychology in particular, and computing should be seen as things which mutually support one another and that can lead to huge breakthroughs in how we design computers and computer algorithms and how we do computing. So that, kind of, set the path for the rest of my career. And that was 40 years ago! 

HUIZINGA: Did you have what we’ll call metacognition of that being an aha moment for you, and like, I’m going to embrace this, and this is my path forward? Or was it just, sort of, more iterative: these things interest you, you take the next step, these things interest you more, you take that step? 

SELLEN: I think it was an aha moment at certain points. Like, for example, the day that Francis Crick walked into our seminar and started talking about biologically inspired models of computing, I thought, “Ooh, there’s something big going on here!” 

HUIZINGA: Wow, yeah. 

SELLEN: Because even then I knew that he was a big deal. So I knew there was something happening that was really, really interesting. I didn’t think so much about it from the point of view of, you know, I would have a career of helping to design human-centric computing, but more, wow, there’s a breakthrough in psychology and how we understand the human mind. And I didn’t realize at that time that that was going to lead to what’s happening in AI today. 

HUIZINGA: Well, let’s talk about some of these people that were influential for you as a follow-up to the animating “big idea.” If I’m honest, Abi, my jaw dropped a little when I read your bio because it’s like a who’s who of human-centered computing and UX design. And now these people are famous. Maybe they weren’t so much at the time. But tell us about the influential people in your life, and how their ideas inspired you?

SELLEN: Yeah, sure, happy to. In fact, I’ll start with one person who is not a, sort of, HCI person, but my stepfather, John Senders, was this remarkable human being. He died three years ago at the age of 98. He worked almost to his dying day. Just an amazing man. He entered my life when I was about 13. He joined the family. And he went to Harvard. He trained with people like Skinner. He was taught by these, kind of, famous psychologists of the 20th century, and they were his friends and his colleagues, and he introduced me to a lot of them. You know, people like Danny Kahneman and, you know, Amos Tversky and Alan Baddeley, and all these people that, you know, I had learned about as an undergrad. But the main thing that John did for me was to open my eyes to how you could think about modeling humans as machines. And he really believed that. He was not only a psychologist, but he was an engineer. And he, sort of, kicked off or he was one of the founders of the field of human factors engineering. And that’s what human factors engineers do. They look at people, and they think, how can we mathematically model them? So, you know, we’d be sitting by a pool, and he’d say, “You can use information sampling to model the frequency with which somebody has to watch a baby as they go towards the pool. And it depends on their velocity and then their trajectory… !” [LAUGHTER] Or we go into a bank, and he’d say, “Abi, how would you use queuing theory to, you know, estimate the mean wait time?” Like, you know, so he got me thinking like that, and he recognized in me that I had this curiosity about the world and about people, but also, that I loved mathematics. So he was the first guy. Don Norman, I’ve already mentioned as my PhD supervisor, and I’ve said something about already how he, sort of, had this radical idea about designing computers for people. And I was fortunate to be there when the field of human-computer interaction was being born, and that was mainly down to him. And he’s just [an] incredible guy. He’s still going. He’s still working, consulting, and he wrote this famous book called The Psychology of Everyday Things, which now is, I think it’s been renamed The Design of Everyday Things, and he was really influential and been a huge supporter of mine. And then the third person I’ll mention is Bill Buxton. And … 

HUIZINGA: Yeah …  

SELLEN: Bill, Bill … 

HUIZINGA: Bill, Bill, Bill! [LAUGHTER] 

SELLEN: Yeah. I met Bill at, first, well, actually first at University of Toronto; when I was a grad student, I went up and told him his … the experiment he was describing was badly designed. And instead of, you know, brushing me off, he said, “Oh really, OK, I want to talk to you about that.” And then I met him at Apple later when I was an intern, and we just started working together. And he is, he’s just … amazing designer. Everything he does is based on, kind of, theory and deep thought. And he’s just so much fun. So I would say those three people have been big influences on me. 

HUIZINGA: Yeah. What about Marilyn Tremaine? Was she a factor in what you did? 

SELLEN: Yes, yeah, she was great. And Ron Baecker. So… 

HUIZINGA: Yeah … 

SELLEN: … after I did my PhD, I did a postdoc at Toronto in the Dynamic Graphics Project Lab. And they were building a media space, and they asked me to join them. And Marilyn and Ron and Bill were building this video telepresence media space, which was way ahead of its time.

HUIZINGA: Yeah. 

SELLEN: So I worked with all three of them, and they were great fun. 

HUIZINGA: Well, let’s talk about the research initiative AI, Cognition, and the Economy. For context, this is a global, interdisciplinary effort to explore the impact of generative AI on human cognition and thinking, work dynamics and practices, and labor markets and the economy. Now, we’ve already lined up some AICE researchers to come on the podcast and talk about specific projects, including pilot studies, workshops, and extended collaborations, but I’d like you to act as a, sort of, docent or tour guide for the initiative, writ large, and tell us why, particularly now, you think it’s important to bring this group of scientists together and what you hope to accomplish. 

SELLEN: I think it’s important now because I think there are so many extreme views out there about how AI is going to impact people. A lot of hyperbole, right. So there’s a lot of fear about, you know, jobs going away, people being replaced, robots taking over the world. And there’s a lot of enthusiasm about how, you know, we’re all going to be more productive, have more free time, how it’s going to be the answer to all our problems. And so I think there are people at either end of that conversation. And I always … I love the Helen Fielding quote … I don’t know if you know Helen Fielding. She wrote… 

HUIZINGA: Yeah, Bridget Jones’s Diary … 

SELLEN:Bridget Jones’s Diary. Yeah. [LAUGHTER] She says, “Nothing is either as bad or as good as it seems,” right. And I live by that because I think things are usually somewhere in the middle. So I’m not saying that we shouldn’t take concerns seriously about AI or be hugely optimistic about the opportunities, but rather, my view on this is that we can do research to get, kind of, line of sight into the future and what is going to happen with AI. And more than this, we should be using research to not just get line of sight but to steer the future, right. We can actually help to shape it. And especially being at Microsoft, we have a chance to do that. So what I mean here is that let’s begin by understanding first the capabilities of AI and get a good understanding of where it’s heading and the pace that it’s heading at because it’s changing so fast, right.  

HUIZINGA: Mm-hmm … 

SELLEN: And then let’s do some research looking at the impact, both in the short term and the long term, about its impact on tasks, on interaction, and, most importantly for me anyway, on people. Yeah, and then we can extrapolate out how this is going to impact jobs, skills, organizations, society at large, you know. So we get this, kind of, arc that we can trace, but we do it because we do research. We don’t just rely on the hyperbole and speculation, but we actually try and do it more systematically. And then I think the last piece here is that if we’re going to do this well and if we think about what AI’s impact can be, which we think is going to impact on a global scale, we need many different skills and disciplines. We need not just machine learning people and engineering and computer scientists at large, but we need designers, we need social scientists, we need even philosophers, and we need domain experts, right. So we need to bring all of these people together to do this properly.

HUIZINGA: Interesting. Well, let’s do break it down a little bit then. And I want to ask you a couple questions about each of the disciplines within the acronym A-I-C-E, or AICE. And I’ll start with AI and another author that we can refer to. Sci-fi author and futurist Arthur C. Clarke famously said that “any sufficiently advanced technology is indistinguishable from magic,” and for many people, AI systems seem to be magic. So in response to that, many in the industry have emphatically stated that AI is just a tool. But you’ve said things like AI is more a “collaborative copilot than a mere tool,” and recently, you said we might even think of it as a “very smart and intuitive butler.” So how do those ideas from the airline industry and Downton Abbey help us better understand and position AI and its role in our world? 

SELLEN: Well, I’m going to use Wodehouse here in a minute as well, but um … so I think AI is different from many other tech developments in a number of important ways. One is, it has agency, right. So it can take initiative and do things on your behalf. It’s highly complex, and, you know, it’s getting more complex by the day. It changes. It’s dynamic. It’s probabilistic rather than deterministic, so it will give you different answers depending on when, you know, when you ask it and what you ask it. And it’s based on human-generated data. So it’s a vastly different kind of tool than HCI, as a field, has studied in the past. There are lots of downsides to that, right. One is it means it’s very hard to understand how it works under the hood, right …  

HUIZINGA: Yeah …  

SELLEN: … and understanding the output. It’s fraught with uncertainty because the output changes every time you use it. But then let’s think about the upsides, especially, large language models give us a way of conversationally interacting with AI like never before, right. So it really is a new interaction paradigm which has finally come of age. So I do think it’s going to get more personal over time and more anticipatory of our needs. And if we design it right, it can be like the perfect butler. So if you know P.G. Wodehouse, Jeeves and Wooster, you know, Jeeves knows that Bertie has had a rough night and has a hangover, so he’s there at the bedside with a tonic and a warm bath already ready for him. But he also knows what Wooster enjoys and what decisions should be left to him, and he knows when to get out of the way. He also knows when to be very discreet, right. So when I use that butler metaphor, I think about how it’s going to take time to get this right, but eventually, we may live in a world where AI helps us with good attention to privacy of getting that kind of partnership right between Jeeves and Wooster. 

HUIZINGA: Right. Do you think that’s possible? 

SELLEN: I don’t think we’ll ever get it exactly right, but if we have a conversational system where we can mutually shape the interaction, then even if Jeeves doesn’t get things right, Wooster can train him to do a better job. 

HUIZINGA: Go back to the copilot analogy, which is a huge thing at Microsoft — in fact, they’ve got products named Copilot — and the idea of a copilot, which is, sort of, assuaging our fears that it would be the pilot … 

SELLEN: Yeah …  

HUIZINGA: … AI.

SELLEN: Yeah, yeah. 

HUIZINGA: So how do we envision that in a way that … you say it’s more than a mere tool, but it’s more like a copilot? 

SELLEN: Yeah, I actually like the copilot metaphor for what you’re alluding to, which is that the pilot is the one who has the final say, who has the, kind of, oversight of everything that’s happening and can step in. And also that the copilot is there in a supportive role, who kind of trains by dint of the fact that they work next to the pilot, and that they have, you know, specialist skills that can help.  

HUIZINGA: Right …   

SELLEN: So I really like that metaphor. I think there are other metaphors that we will explore in future and which will make sense for different contexts, but I think, as a metaphor for a lot of the things we’re developing today, it makes a lot of sense. 

HUIZINGA: You know, it also feels like, in the conversation, words really matter in how people perceive what the tool is. So having these other frameworks to describe it and to implement it, I think, could be really helpful. 

SELLEN: Yes, I agree. 

[MUSIC BREAK] 

HUIZINGA: Well, let’s talk about intelligence for a second. One of the most interesting things about AI is it’s caused us to pay attention to other kinds of intelligence. As author Meghan O’Gieblyn puts it, “God, human, animal, machine … ” So why do you think, Abi, it’s important to understand the characteristics of each kind of intelligence, and how does that impact how we conceptualize, make, and use what we’re calling artificial intelligence? 

SELLEN: Yeah, well, I actually prefer the term machine intelligence to artificial intelligence … 

HUIZINGA: Me too! Thank you! [LAUGHTER] 

SELLEN: Because the latter implies that there’s one kind of intelligence, and also, it does allude to the fact that that is human-like. You know, we’re trying to imitate the human. But if you think about animals, I think that’s really interesting. I mean, many of us have good relationships with our pets, right. And we know that they have a different kind of intelligence. And it’s different from ours, but that doesn’t mean we can’t understand it to some extent, right. And if you think about … animals are superhuman in many ways, right. They can do things we can’t. So whether it’s an ox pulling a plow or a dog who can sniff out drugs or ferrets who can, you know, thread electrical cables through pipes, they can do things. And bee colonies are fascinating to me, right. And they work as a, kind of, a crowd intelligence, or hive mind, right. [LAUGHTER] That’s where that comes from. And so in so many ways, animals are smarter than humans. But it doesn’t matter—like this “smarter than” thing also bugs me. It’s about being differently intelligent, right. And the reason I think that’s important when we think about machine intelligence is that machine intelligence is differently intelligent, as well. So the conversational interface allows us to explore the nature of that machine intelligence because we can speak to it in a kind of human-like way, but that doesn’t mean that it is intelligent in the same way a human is intelligent. And in fact, we don’t really want it to be, right. 

HUIZINGA: Right … 

SELLEN: Because we want it, we want it to be a partner with us, to do things that we can’t, you know, just like using the plow and the ox. That partnership works because the ox is stronger than we are. So I think machine intelligence is a much better word, and understanding it’s not human is a good thing. I do worry that, because it sounds like a human, it can seduce us into thinking it’s a human …

HUIZINGA: Yeah … 

SELLEN: and that can be problematic. You know, there are instances where people have been on, for example, dating sites and a bot is sounding like a human and people get fooled. So I think we don’t want to go down the path of fooling people. We want to be really careful about that. 

HUIZINGA: Yeah, this idea of conflating different kinds of intelligences to our own … I think we can have a separate vision of animal intelligence, but machines are, like you say, kind of seductively built to be like us.  

SELLEN: Yeah …  

HUIZINGA: And so back to your comment about shaping how this technology moves forward and the psychology of it, how might we envision how we could shape, either through language or the way these machines operate, that we build in a “I’m not going to fool you” mechanism? 

SELLEN: Well, I mean, there are things that we do at the, kind of, technical level in terms of guardrails and metaprompts, and we have guidelines around that. But there’s also the language that an AI character will use in terms of, you know, expressing thoughts and feelings and some suggestion of an inner life, which … these machines don’t have an inner life, right. 

HUIZINGA: Right! 

SELLEN: So … and one of the reasons we talk to people is we want to discover something about their inner life. 

HUIZINGA: Yessss … 

SELLEN: And so why would I talk to a machine to try and discover that? So I think there are things that we can do in terms of how we design these systems so that they’re not trying to deceive us. Unless we want them to deceive us. So if we want to be entertained or immersed, maybe that’s a good thing, right? That they deceive us. But we enter into that knowing that that’s what’s happening, and I think that’s the difference.

HUIZINGA: Well, let’s talk about the C in A-I-C-E, which is cognition. And we’ve just talked about other kinds of intelligence. Let’s broaden the conversation and talk about the impact of AI on humans themselves. Is there any evidence to indicate that machine intelligence actually has an impact on human intelligence, and if so, why is that an important data point? 

SELLEN: Yeah, OK, great topic. This is one of my favorite topics. [LAUGHTER] So, well, let me just backtrack a little bit for a minute. A lot of the work that’s coming out today looking at the impact of AI on people is in terms of their productivity, in terms of how fast they can do something, how efficiently they can do a job, or the quality of the output of the tasks. And I do think that’s important to understand because, you know, as we deploy these new tools in peoples’ hands, we want to know what’s happening in terms of, you know, peoples’ productivity, workflow, and so on. But there’s far less of it on looking at the impact of using AI on people themselves and on how people think, on their cognitive processes, and how are these changing over time? Are they growing? Are they atrophying as they use them? And, relatedly, what’s happening to our skills? You know, over time, what’s going to be valued, and what’s going to drop away? And I think that’s important for all kinds of reasons. So if you think about generative AI, right, these are these AI systems that will write something for us or make a slide deck or a picture or a video. What they’re doing is they are taking the cognitive work of generation of an artifact or the effort of self-expression that most of us, in the old-fashioned world, will do, right—we write something, we make something—they’re doing that for us on our behalf. And so our job then is to think about how do we specify our intention to the machine, how do we talk to it to get it to do the things we want, and then how do we evaluate the output at the end? So it’s really radically shifting what we do, the work that we do, the cognitive and mental work that we do, when we engage with these tools. Now why is that a problem? Or should it be a problem? One concern is that many of us think and structure our thoughts through the process of making things, right. Through the process of writing or making something. So a big question for me is, if we’re removed from that process, how deeply will we learn or understand what we’re writing about? A second one is, you know, if we’re not deeply engaged in the process of generating these things, does that actually undermine our ability to evaluate the output when we do get presented with it?  

HUIZINGA: Right … 

SELLEN: Like, if it writes something for us and it’s full of problems and errors, if we stop writing for ourselves, are we going to be worse at, kind of, judging the output? Another one is, as we hand things over to more and more of these automated processes, will we start to blindly accept or over-rely on our AI assistants, right. And the aviation industry has known that for years … 

HUIZINGA: Yeah … 

SELLEN: … which is why they stick pilots in simulators. Because they rely on autopilot so much that they forget those key skills. And then another one is, kind of, longer term, which is like these new generations of people who are going to grow up with this technology, what are the fundamental skills that they’re going to need to not just to use the AI but to be kind of citizens of the world and also be able to judge the output of these AI systems? So the calculator, right, is a great example. When it was first introduced, there was a huge outcry around, you know, kids won’t be able to do math anymore! Or we don’t need to teach it anymore. Well, we do still teach it because when you use a calculator, you need to be able to see whether or not the output the machine is giving you is in the right ballpark, right.

HUIZINGA: Right … 

SELLEN: You need to know the basics. And so what are the basics that kids are going to need to know? We just don’t have the answer to those questions. And then the last thing I’ll say on this, because I could go on for a long time, is we also know that there are changes in the brain when we use these new technologies. There are shifts in our cognitive skills, you know, things get better and things do deteriorate. So I think Susan Greenfield is famous for her work looking at what happens to the neural pathways in the age of the internet, for example. So she found that all the studies were pointing to the fact that reading online and on the internet meant that our visual-spatial skills were being boosted, but our capacity to do deep processing, mindful knowledge acquisition, critical thinking, reflection, were all decreasing over time. And I think any parent who has a teenager will know that focus of attention, flitting from one thing to another, multitasking, is, sort of, the order of the day. Well, not just for teenagers. I think all of us are suffering from this now. It’s much harder. I find it much harder to sit down and read something in a long, focused way … 

HUIZINGA: Yeah …  

SELLEN: … than I used to. So all of this long-winded answer is to say, we don’t understand what the impact of these new AI systems is going to be. We need to do research to understand it. And we need to do that research both looking at short-term impacts and long-term impacts. Not to say that this is all going to be bad, but we need to understand where it’s going so we can design around it. 

HUIZINGA: You know, even as you asked each of those questions, Abi, I found myself answering it preemptively, “Yes. That’s going to happen. That’s going to happen.” [LAUGHS] And so even as you say all of this and you say we need research, do you already have some thinking about, you know, if research tells us the answer that we thought might be true already, do we have a plan in place or a thought process in place to address it? 

SELLEN: Well, yes, and I think we’ve got some really exciting research going on in the company right now and in the AICE program, and I’m hoping your future guests will be able to talk more in-depth about these things. But we are looking at things like the impact of AI on writing, on comprehension, on mathematical abilities. But more than that. Not just understanding the impact on these skills and abilities, but how can we design systems better to help people think better, right?  

HUIZINGA: Yeah … 

SELLEN: To help them think more deeply, more creatively. I don’t think AI needs to necessarily de-skill us in the critical skills that we want and need. It can actually help us if we design them properly. And so that’s the other part of what we’re doing. It’s not just understanding the impact, but now saying, OK, now that we understand what’s going on, how do we design these systems better to help people deepen their skills, change the way that they think in ways that they want to change—in being more creative, thinking more deeply, you know, reading in different ways, understanding the world in different ways. 

HUIZINGA: Right. Well, that is a brilliant segue into my next question. Because we’re on the last letter, E, in AICE: the economy. And that I think instills a lot of fear in people. To cite another author, since we’re on a citing authors roll, Clay Shirky, in his book Here Comes Everybody, writes about technical revolutions in general and the impact they have on existing economic paradigms. And he says, “Real revolutions don’t involve an orderly transition from point A to point B. Rather, they go from A, through a long period of chaos, and only then reach B. And in that chaotic period the old systems get broken long before the new ones become stable.” Let’s take Shirky’s idea and apply it to generative AI. If B equals the future of work, what’s getting broken in the period of transition from how things were to how things are going to be, what do we have to look forward to, and how do we progress toward B in a way that minimizes chaos? 

SELLEN: Hmm … oh, those are big questions! [LAUGHS] 

HUIZINGA: Too many questions! [LAUGHS] 

SELLEN: Yeah, well, I mean, Shirky was right. Things take a long time to bed in, right. And much of what happens over time, I don’t think we can actually predict. You know, so who would have predicted echo chambers or the rise of deepfakes or, you know, the way social media could start revolutions in those early days of social media, right. So good and bad things happen, and a lot of it’s because it rolls out over time, it scales up, and then people get involved. And that’s the really unpredictable bit, is when people get involved en masse. I think we’re going to see the same thing with AI systems. They are going to take a long time to bed in, and its impact is going to be global, and it’s going to take a long time to unfold. So I think what we can do is, to some extent, we can see the glimmerings of what’s going to happen, right. So I think it’s the William Gibson quote is, you know, “The future’s already here; it’s just not evenly distributed,” or something like that, right. We can see some of the problems that are playing out, both in the hands of bad actors and things that will happen unintentionally. We can see those, and we can design for them, and we can do things about it because we are alert and we are looking to see what happens. And also, the good things, right. And all the good things that are playing out, … 

HUIZINGA: Yeah …  

SELLEN: we can make the most of those. Other things we can do is, you know, at Microsoft, we have a set of responsible AI principles that we make sure all our products go through to make sure that we look into the future as much as we can, consider what the consequences might be, and then deploy things in very careful steps, evaluating as we go. And then, coming back to what I said earlier, doing deep research to try and get a better line of sight. So in terms of what’s going to happen with the future of work, I think, again, we need to steer it. Some of the things I talked about earlier in terms of making sure we build skills rather than undermine them, making sure we don’t over automate, making sure that we put agency in the hands of people. And always making sure that we design our AI experiences with human hope, aspirations, and needs in mind. If we do that, I think we’re on a good track, but we should always be vigilant, you know, to what’s evolving, what’s happening here.  

HUIZINGA: Yeah …

SELLEN: I can’t really predict whether we’re headed for chaos or not. I don’t think we are, as long as we’re mindful. 

HUIZINGA: Yeah. And it sounds like there’s a lot more involved outside of computer science, in terms of support systems and education and communication, to acclimatize people to a new kind of economy, which like you say, you can’t … I’m shocked that you can’t predict it, Abi. I was expecting that you could, but … [LAUGHTER] 

SELLEN: Sorry. 

HUIZINGA: Sorry! But yeah, I mean, do you see the ancillary industries, we’ll call them, in on this? And how can, you know, sort of, a lab in Cambridge, and labs around the world that are doing AI, how can they spread out to incorporate these other things to help the people who know nothing about what’s going on in your lab move forward here? 

SELLEN: Well, I think, you know, there are lots of people that we need to talk to and to take account of. The word stakeholder … I hate that word stakeholder! I’m not sure why. [LAUGHTER] But anyway, stakeholders in this whole AI odyssey that we’re on … you know, public perceptions are one thing. I’m a member of a lot of societies where we do a lot of outreach and talks about AI and what’s going on, and I think that’s really, really important. And get people excited also about the possibilities of what could happen.  

HUIZINGA: Yeah …  

SELLEN: Because I think a lot of the media, a lot of the stories that get out there are very dystopian and scary, and it’s right that we are concerned and we are alert to possibilities, but I don’t think it does anybody any good to make people scared or anxious. And so I think there’s a lot we can do with the public. And there’s a lot we can do with, when I think about the future of work, different domains, you know, and talking to them about their needs and how they see AI fitting into their particular work processes. 

HUIZINGA: So, Abi, we’re kind of [LAUGHS] dancing around these dystopian narratives, and whether they’re right or wrong, they have gained traction. So it’s about now that I ask all of my guests what could go wrong if you got everything right? So maybe you could present, in this area, some more hopeful, we’ll call them “-topias,” or preferred futures, if you will, around AI and how you and/or your lab and other people in the industry are preparing for them. 

SELLEN: Well, again, I come back to the idea that the future is all around us to some extent, and we’re seeing really amazing breakthroughs, right, with AI. For example, scientific breakthroughs in terms of, you know, drug discovery, new materials to help tackle climate change, all kinds of things that are going to help us tackle some of the world’s biggest problems. Better understandings of the natural world, right, and how interventions can help us. New tools in the hands of low-literacy populations and support for, you know, different ways of working in different cultures. I think that’s another big area in which AI can help us. Personalization—personalized medicine, personalized tutoring systems, right. So we talked about education earlier. I think that AI could do a lot if we design it right to really help in education and help support people’s learning processes. So I think there’s a lot here, and there’s a lot of excitement—with good reason. Because we’re already seeing these things happening. And we should bear those things in mind when we start to get anxious about AI. And I personally am really, really excited about it. I’m excited about, you know, what the company I work for is doing in this area and other companies around the world. I think that it’s really going to help us in the long term, build new skills, see the world in new ways, you know, tackle some of these big problems. 

HUIZINGA: I recently saw an ad—I’m not making this up—it was the quote-unquote “productivity app,” and it was simply a small wooden box filled with pieces of paper. And there was a young man who had a how-to video on how to use it on YouTube. [LAUGHS] He was clearly born into the digital age and found writing lists on paper to be a revolutionary idea. But I myself have toggled back and forth between what we’ll call the affordances of the digital world and the familiarity and comfort of the physical world. And you actually studied this and wrote about it in a book called The Myth of the Paperless Office. That was 20 years ago. Why did you do the work then, what’s changed in the ensuing years, and why in the age of AI do I love paper so much?

SELLEN: Yeah, so, that was quite a while ago now. It was a book that I cowrote with my husband. He’s a sociologist, so we, sort of, came together on that book, me as a psychologist and he as a sociologist. What we were responding to at the time was a lot of hype about the paperless office and the paperless future. At the time, I was working at EuroPARC, you know, which is the European sister lab of Xerox PARC. And so, obviously, they had big investment in this. And there were many people in that lab who really believed in the paperless office, and lots of great inventions came out of the fact that people were pursuing that vision. So that was a good side of that, but we also saw where things could go horribly wrong when you just took a paper-based system away and you just replaced it with a digital system.  

HUIZINGA: Yeah … 

SELLEN: I remember some of the disasters in air traffic control, for example, when they took the paper flight strips away and just made them all digital. And those are places where you don’t want to mess around with something that works. 

HUIZINGA: Right. 

SELLEN: You have to be really careful about how you introduce digital systems. Likewise, many people remember things that went wrong when hospitals tried to go paperless with health records being paperless. Now, those things are digital now, but we were talking about chaos earlier. There was a lot of chaos on the path. So what we’ve tried to say in that book to some extent is, let’s understand the work that paper is doing in these different work contexts and the affordances of paper. You know, what is it doing for people? Anything from, you know, I hand a document over to someone else; a physical document gives me the excuse to talk to that person …  

HUIZINGA: Right… 

SELLEN: … through to, you know, when I place a document on somebody’s desk, other people in the workplace can see that I’ve passed it on to someone else. Those kind of nuanced observations are useful because you then need to think, how’s the digital system going to replace that? Not in the same way, but it’s got to do the same job, right. So you need to talk to people, you need to understand the context of their work, and then you need to carefully plan out how you’re going to make the transition. So if we just try to inject AI into workflows or totally replace parts of workflows with AI without a really deep understanding of how that work is currently done, what the workers get from it, what is the value that the workers bring to that process, we could go through that chaos. And so it’s really important to get social scientists involved in this and good designers, and that’s where the, kind of, multidisciplinary thing really comes into its own. That’s where it’s really, really valuable. 

HUIZINGA: Yeah … You know, it feels super important, that book, about a different thing, how it applies now and how you can take lessons from that arc to what you’re talking about with AI. I feel like people should go back and read that book. 

SELLEN: I wouldn’t object! [LAUGHTER] 

[MUSIC BREAK] 

HUIZINGA: Let’s talk about some research ideas that are on the horizon. Lots of research is basically just incremental building on what’s been done before, but there are always those moonshot ideas that seem outrageous at first. Now, you’re a scientist and an inventor yourself, and you’re also a lab director, so you’ve seen a lot of ideas over the years. [LAUGHS] You’ve probably had a lot of ideas. Have any of them been outrageous in your mind? And if so, what was the most outrageous, and how did it work out? 

SELLEN: OK, well, I’m a little reluctant to say this one, but I always believed that the dream of AI was outrageous. [LAUGHTER] So, you know, going back to those early days when, you know, I was a psychologist in the ’80s and seeing those early expert systems that were being built back then and trying to codify and articulate expert knowledge into machines to make them artificially intelligent, it just seemed like they were on a road to nowhere. I didn’t really believe in the whole vision of AI for many, many years. I think that when deep learning, that whole revolution’s kicked off, I never saw where it was heading. So I am, to this day, amazed by what these systems can do and never believed that these things would be possible. And so I was a skeptic, and I am no longer a skeptic, [LAUGHTER] with a proviso of everything else I’ve said before, but I thought it was an outrageous idea that these systems would be capable of what they’re now capable of. 

HUIZINGA: You know, that’s funny because, going back to what you said earlier about your stepdad walking you around and asking you how you’d codify a human into a machine … was that just outrageous to you, or is that just part of the exploratory mode that your stepdad, kind of, brought you into? 

SELLEN: Well, so, back then I was quite young, and I was willing to believe him, and I, sort of, signed up to that. But later, especially when I met my husband, a sociologist, I realized that I didn’t agree with any of that at all. [LAUGHTER] So we had great, I’ll say, “energetic” discussions with my stepdad after that, which was fun.  

HUIZINGA: I bet.  

SELLEN: But yeah, but so, it was how I used to think and then I went through this long period of really rejecting all of that. And part of that was, you know, seeing these AI systems really struggle and fail. And now here we are today. So yeah. 

HUIZINGA: Yeah, I just had Rafah Hosn on the podcast and when we were talking about this “outrageous ideas” question, she said, “Well, I don’t really see much that’s outrageous.” And I said, “Wait a minute! You’re living in outrageous! You are in AI Frontiers at Microsoft Research.” Maybe it’s just because it’s so outrageous that it’s become normal?

SELLEN: Yeah … 

HUIZINGA: And yeah, well … Well, finally, Abi, your mentor and adviser, Don Norman … you referred to a book that he wrote, and I know it as The Design of Everyday Things, and in it he wrote this: “Design is really an act of communication, which means having a deep understanding of the person with whom the designer is communicating.” So as we close, I’d love it if you’d speak to this statement in the context of AI, Cognition, and the Economy. How might we see the design of AI systems as an act of communication with people, and how do we get to a place where an understanding of deeply human qualities plays a larger role in informing these ideas, and ultimately the products, that emerge from a lab like yours? 

SELLEN: So this is absolutely critical to getting AI development and design right. It’s deeply understanding people and what they need, what their aspirations are, what human values are we designing for. You know, I would say that as a social scientist, but I also believe that most of the technologists and computer scientists and machine learning people that I interact with on a daily basis also believe that. And that’s one thing that I love about the lab that I’m a part of, is that it’s very interdisciplinary. We’re always putting the, kind of, human-centric spin on things. And, you know, Don was right. And that’s what he’s been all about through his career. We really need to understand, who are we designing this technology for? Ultimately, it’s for people; it’s for society; it’s for the, you know, it’s for the common good. And so that’s what we’re all about. Also, I’m really excited to say we are becoming, as an organization, much more globally distributed. Just recently taken on a lab in Nairobi. And the cultural differences and the differences in different countries casts a whole new light on how these technologies might be used. And so I think that it’s not just about understanding different people’s needs but different cultures and different parts of the world and how this is all going to play out on a global scale. 

HUIZINGA: Yeah … So just to, kind of, put a cap on it, when I said the term “deeply human qualities,” what I’m thinking about is the way we collaborate and work as a team with other people, having empathy and compassion, being innovative and creative, and seeking well-being and prosperity. Those are qualities that I have a hard time superimposing onto or into a machine. Do you think that AI can help us? 

SELLEN: Yeah, I think all of these things that you just named are things which, as you say, are deeply human, and they are the aspects of our relationship with technology that we want to not only protect and preserve but support and amplify. And I think there are many examples I’ve seen in development and coming out which have that in mind, which seek to augment those different aspects of human nature. And that’s exciting. And we always need to keep that in mind as we design these new technologies. 

HUIZINGA: Yeah. Well, Abi Sellen, I’d love to stay and chat with you for another couple hours, but how fun to have you on the show. Thanks for joining us today on Ideas

SELLEN: It’s been great. I really enjoyed it. Thank you.

[MUSIC]

The post Ideas: Designing AI for people with Abigail Sellen appeared first on Microsoft Research.

Read More

What’s Your Story: Jacki O’Neill

What’s Your Story: Jacki O’Neill

Circle photo of Jacki O'Neill, director of the Microsoft Africa Research Institute (MARI), with a microphone in the corner on a blue and green gradient background

In the Microsoft Research Podcast series What’s Your Story, Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. A systems expert whose 10 years with Microsoft spans research and product, Gehrke talks to members of the company’s research community about what motivates their work and how they got where they are today.

In this episode, Gehrke is joined by Jacki O’Neill, director of Microsoft Research Africa, Nairobi (formerly the Microsoft Africa Research Institute, or MARI) in Kenya. O’Neill pitched the idea for the lab after seeing an opportunity to expand the Microsoft research portfolio. She shares how a desire to build tech that can have global societal impact and a familial connection to the continent factored into the decision; how a belief that life is meant to be exciting has allowed her to take big personal and professional swings; and how her team in Nairobi is applying their respective expertise in human-computer interaction, machine learning, and data science to pursue globally equitable AI.

To learn more about the global impact of AI, efforts to make AI more equitable, and related topics, register for Microsoft Research Forum (opens in new tab), a series of panel discussions and lightning talks around science and technology research in the era of general AI.

Photos of Jacki O'Neill, director of the Microsoft Africa Research Institute (MARI), throughout her life.

The post What’s Your Story: Jacki O’Neill appeared first on Microsoft Research.

Read More

Research Focus: Week of May 13, 2024

Research Focus: Week of May 13, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus: May 13, 2024

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning 

Large language models (LLMs) have shown remarkable performance in generating text similar to that created by people, proving to be a valuable asset across various applications. However, adapting these models to incorporate new, out-of-domain knowledge remains a challenge, particularly for facts and events that occur after the model’s training knowledge cutoff date.

In a recent paper: Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning, researchers from Microsoft investigate the effectiveness of supervised fine-tuning (SFT) as a method for knowledge injection in LLMs, specifically focusing on recent sporting events. They compare different dataset generation strategies—token-based and fact-based scaling—to create training data that helps the model learn new information. Their experiments on GPT-4 demonstrate that while token-based scaling can lead to improvements in Q&A accuracy, it may not provide uniform coverage of new knowledge. Fact-based scaling, on the other hand, offers a more systematic approach to ensure even coverage across all facts. The researchers present a novel dataset generation process that leads to more effective knowledge ingestion through SFT, and results show considerable performance improvements in Q&A tasks related to out-of-domain knowledge. 


A Reflection on Human-Notebook Experiences in the Era of AI

Computational notebooks provide an interactive way to work with data. They have been widely used by data professionals to write code, explore data, and generate visualizations, all in one document. Previous research has revealed unique pain points around the user experience in computational notebooks. However, as AI tools like ChatGPT or Copilot have emerged, it is unclear whether these pain points have been reduced or changed, or whether new pain points have arisen. Due to the fast pace of advances in AI technology, most of the development of new AI tools has been primarily driven by technology and not by user experience.

In a recent paper: A Reflection on Human-Notebook Experiences in the Era of AI, researchers from Microsoft summarize literature on how new AI technology has impacted human-notebook interaction and human-computer interaction (HCI) paradigms, new challenges and user behavior around using AI assistants, and recent research on AI assistants in computational notebook scenarios. They outline gaps in existing literature and suggest a future focus on improving macro human-notebook experiences throughout a user’s workflow, measuring and quantifying the value of AI systems, and establishing a set of standards and best practices for AI tools.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.


Jacdac: Service-Based Prototyping of Embedded Systems

The traditional approach to programming embedded systems is monolithic: firmware on a microcontroller contains both application code and the drivers needed to communicate with sensors and actuators, using low-level protocols such as I2C, SPI, and RS232. In comparison, software development for the cloud has moved to a service-based development and operation paradigm: a service provides a discrete unit of functionality that can be accessed remotely by an application, or other service, but is independently managed and updated.

In a recent paper: Jacdac: Service-Based Prototyping of Embedded Systems (opens in new tab), researchers from Microsoft propose, design, implement, and evaluate a service-based approach to prototyping embedded systems called Jacdac (opens in new tab). Jacdac defines a service specification language, designed especially for embedded systems, along with a host of specifications for a variety of sensors and actuators. With Jacdac, each sensor/actuator in a system is paired with a low-cost microcontroller that advertises the services that represent the functionality of the underlying hardware over an efficient and low-cost single-wire bus protocol. A separate microcontroller executes the user’s application program, which is a client of the Jacdac services on the bus. 

Three Jacdac kits, comprising over twenty modules, have been produced by third-party manufacturers: KittenBot (opens in new tab) and Forward Education (opens in new tab).


PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models

Evaluation of multilingual LLMs is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data, and the lack of local, cultural nuances in translated benchmarks. Hence, it is difficult to extensively evaluate LLMs in a multilingual setting, leading to lack of fair comparisons between models and difficulties in replicating the evaluation setup used by some models. Recently, several Indic (Indian language) LLMs have been created to help build more locally and culturally relevant LLMs.

In a recent paper: PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models, researchers from Microsoft present an evaluation framework, which is the first comprehensive evaluation of Indic LLMs using a combination of human and LLM-based evaluation. The researchers conduct a total of 90,000 human evaluations and 50,000 LLM-based evaluations of 29 models to present leaderboards for 10 Indic languages. Pariksha provides inclusive evaluation by engaging a community of workers that represent India’s large and diverse workforce and also serves as a research platform for improving the process of evaluation. For transparency on the process, the evaluation artifacts will be released. Conducting Pariksha at regular intervals, the researchers aim to enable models to improve over time with insights and artifacts from their evaluations. 


Tinker, Tailor, Configure, Customize: The Articulation Work of Customizing AI Fairness Checklists

Many responsible AI resources, such as toolkits, playbooks, and checklists, have been developed to support AI practitioners in identifying, measuring, and mitigating potential fairness-related harms. These resources are often designed to be general purpose, in order to address a variety of use cases, domains, and deployment contexts. However, this can lead to decontextualization, where such resources lack the level of relevance or specificity needed to use them.

To understand how AI practitioners might contextualize one such resource, an AI fairness checklist, for their particular use cases, domains, and deployment contexts, researchers from Microsoft conducted a retrospective contextual inquiry with 13 AI practitioners from seven organizations. In a recent paper: Tinker, Tailor, Configure, Customize: The Articulation Work of Customizing AI Fairness Checklists, they identify how contextualizing this checklist introduces new forms of work for AI practitioners and other stakeholders, while opening up new sites for negotiation and contestation of values in AI. The researchers also identify how the contextualization process may help AI practitioners develop a shared language around AI fairness. They also identify dynamics related to ownership over this process that suggest larger issues of accountability in responsible AI work. 


MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

LLMs are becoming indispensable tools for many creative and information related tasks, but they still come with limitations, including a tendency to fabricate content. State-of-the-art algorithms pair the LLM with an external, dynamically updated knowledge base to ground the LLM’s answers and provide up-to-date information. However, these techniques require large amounts of relevant, labeled training data that have not previously been publicly available. 

In a recent paper: MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels presented at the 2024 ACM Web Conference, researchers from Microsoft introduce a novel dataset that closely mimics real-world web document and query distribution. MS MARCO Web Search contains 10 million unique queries across 93 languages with millions of relevant labeled query-document pairs. It uses ClueWeb22’s 10 billion high-quality web pages as the document corpus and provides rich information for various kinds of downstream tasks. 

This dataset unlocks several new research directions that previous datasets cannot well support, including generic end-to-end neural indexer models, generic embedding models, and next generation information access systems with LLMs. MS MARCO Web Search offers a retrieval benchmark with three web scale retrieval challenge tasks, each with automatic evaluation and leaderboard. These tasks demand innovation in both machine learning and information retrieval systems. The researchers intend for MS MARCO Web Search to lay the groundwork for future advancements in AI and systems research.


AI Case Studies for Natural Science Research with Bonnie Kruft

Among the stunning changes and disruptions driven by AI, one of the most significant is the impact on scientific discovery. In her presentation at EmTech Digital 2024 (opens in new tab), Bonnie Kruft, partner deputy director at Microsoft Research AI for Science, outlined some examples of how generative AI enables groundbreaking research in the natural sciences. Recent breakthroughs aided by AI include small molecular inhibitors for treating infectious disease, the discovery of new materials for energy storage, and new drug development. 

Catch a replay of the presentation, including a follow-up Q&A with the audience, and hear how researchers are reducing discovery times from years to months. The discussion explores safe and responsible AI practices, how large language models can work with science-based models, and what lies ahead for AI in science. 

Microsoft Research in the news


The tiny glass blocks that can preserve your data for centuries 

The Times UK | April 27, 2024

Microsoft’s Project Silica is an innovative form of long-term storage – potentially revolutionizing how important data can be preserved for future generations.


These Recyclable Circuit Boards Could Stem E-Waste 

IEEE Spectrum | May 2, 2024

New research from the University of Washington and Microsoft show that vitrimer-based PCBs can be broken down into a gel for repeated reuse. The research stems from the Microsoft Research Climate Initiative.


Today’s AI models are impressive. Teams of them will be formidable 

The Economist | May 13, 2024

Teams of LLMs are more capable and intelligent than solitary agents because a single job can be split into many smaller, more specialized tasks, says Chi Wang, a principal researcher at Microsoft Research in Redmond, Washington.


You Only Cache Once: Decoder-Decoder Architectures for Language Models 

Microsoft Research LinkedIn | May 11, 2024

YOCO is a novel decoder-decoder architecture for LLMs, enhancing memory efficiency by caching key-value pairs only once. It slashes KV cache memory and prefilling time and makes 1M-length LLMs practical.


Peter Lee discusses new technologies that will drive the future of drug discovery 

AAPS | May 10, 2024

The president of Microsoft Research explores how new advances in technologies, such as AI and machine learning, are transforming biotechnology, in the closing plenary of the AAPS National Biotechnology Conference (NBC) on Thursday, May 16.


PKSHA develops advanced LLMs in collaboration with Microsoft Japan 

Business Wire | April 29, 2024

PKSHA Technology has developed one of the first Japanese-English LLMs in collaboration with Microsoft Japan. This development primarily focuses on boosting productivity within contact centers and corporate help desks.


BRAID fellowships include three collaborations with Microsoft Research 

Bridging Responsible AI Divides | May 2024

BRAID fellowships support individual researchers in partnership with public and private organizations to address challenges in the field of responsible AI. Among the latest fellowships are three supported by Microsoft Research.

The post Research Focus: Week of May 13, 2024 appeared first on Microsoft Research.

Read More