Automatic post-deployment management of cloud applications

Automatic post-deployment management of cloud applications

SelfTune interaction with Client (Developer Machine) into Data Store (Azure ML Workspace)

Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.

SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

Basic SelfTune workflow
  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).

Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

SelfTune Application library contains the SelfTune client API and the RL/ML algorithms

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Read More

Automatic post-deployment management of cloud applications

Automatic post-deployment management of cloud applications

SelfTune interaction with Client (Developer Machine) into Data Store (Azure ML Workspace)

Cloud Intelligence/AIOps blog series

In the first two blog posts in this series, we presented our vision for Cloud Intelligence/AIOps (AIOps) research, and scenarios where innovations in AI technologies can help build and operate complex cloud platforms and services effectively and efficiently at scale. In this blog post, we dive deeper into our efforts to automatically manage large-scale cloud services in deployment. In particular, we focus on an important post-deployment cloud management task that is pervasive across cloud services – tuning configuration parameters. And we discuss SelfTune, a horizontal reinforcement learning (RL) platform for successful configuration management of various cloud services in deployment.

Post-deployment management of cloud applications

Managing cloud applications includes mission-critical tasks such as resource allocation, scheduling, pre-provisioning, capacity planning and provisioning, and autoscaling. Currently, several of these tasks rely on hand-tuned and manually designed algorithms, heuristics, and domain knowledge. For a large cloud company like Microsoft, a hand-tuned, manually designed algorithm works well only to a certain extent, because deployments are extremely varied, large-scale, and involve complex interactions of various components. Moreover, user, customer, and application behavior can change over time, making yesterday’s hand-tuning not as relevant today and even less so in the future. The varied nature of today’s cloud technologies forces our engineers to spend an inordinate amount of time on special casing, introducing new configuration parameters, and writing or rewriting heuristics to set them appropriately. This also creates a lot of undocumented domain knowledge and dependence on a few individuals to solve significant problems. All of this, we believe, is unsustainable in the long term.

As we discussed in the earlier posts in this blog series, the right AI/ML formulations and techniques could help to alleviate this problem. Specifically, cloud management tasks are a natural fit for adopting the reinforcement learning paradigm. These tasks are repetitive in space and time; they run simultaneously on multiple machines, clusters, datacenters, and/or regions, and they run once every hour, day, week, or month. For instance, the VM pre-provisioning service for Azure Functions is a continuously running process, pre-provisioning for every application. Scheduling of background jobs on substrate runs separately on every machine. Reinforcement learning also needs a repetitive and iterative platform to converge on an optimized setup and, hence, can go together with the basic functioning of the cloud management task.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Our goal is to reduce manual effort in ensuring service efficiency, performance, and reliability by augmenting, complimenting, or replacing existing heuristics for various management tasks with general RL-based solutions. In this blog post, we present our recent solution frameworks for cloud applications, to automatically tune their configuration parameters and to design policies for managing the parameters over time. Our solutions require minimal engineering effort and no AI expertise from the application developers or cloud operators.

Example Microsoft scenarios

O365 Workload Manager: Workload Manager (WLM) is a process that runs on each of the backend Exchange Online (EXO) servers to help schedule resources (CPU, disk, network) to background jobs that periodically execute. WLM has several configuration parameters that need to be carefully set so that the throughput of the scheduler is maximized while also ensuring that the resources are not too strained to execute low-latency user-facing jobs (e.g., Outlook search). Could we help EXO infrastructure manage the various knobs that dictate the control logic implemented in the scheduler for optimizing resource management and user latency?

Azure ML/Spark: Spark is the platform for performing distributed data analytics, and it comes with various configuration knobs that need to be appropriately set by developers based on their job context: Does the query involve JOIN clauses? How big are the data shards? The workload patterns change over time, and pre-trained models for choosing optimal configurations may not suffice. Can we help developers dynamically choose the deployment configuration based on workload signals?

Azure Functions VM management: Can we tune the VM management policy implemented in Azure Functions for VM pre-fetching/eviction to minimize cold starts and memory wastage over time? Our results in simulations are quite encouraging. We want to engage with the Azure and MSR Redmond teams to discuss the possibility of tuning the policy in the production setting.

Azure Kubernetes Service: AKS is chosen by first-party as well as third-party Azure customers for facilitating containerized development and deployment of cloud applications. The in-built workload autoscaling policies in AKS use several configuration parameters, which can be far from optimal in several scenarios. Can we help automatically adjust the parameters that govern resource allocation to containers running microservices based on applications’ workload patterns?

Horizontal solution design for configuration tuning

We see three main reasons why this is the right time to design and incorporate an RL-based solution framework across cloud management tasks:

  1. As the size and complexity of services in the cloud continue to increase, as our hardware footprint continues to include many SKUs, and as configuration and code get larger and more complex, heuristics and hand-tuning cannot provide optimal operations at all times. Not without significant and proportionate investment in human experts and engineers.
  2. While we will have to rely on domain experts for key changes in systems and the services landscape on the cloud, using RL sub-systems can help reduce dependence on expert decisions and domain-knowledge over time.
  3. It is important to have a horizontal framework with a simple yet expressive API, with appropriate algorithms for tuning configuration parameters in an online fashion to optimize a developer-specific metric of interest or reward.

SelfTune framework

We have designed and deployed the SelfTune framework to help cloud service developers automatically tune the configuration parameters in their codebase, which would otherwise be manually set or heuristically tweaked. SelfTune is an RL-based framework that helps developers automate complex post-deployment cloud management tasks such as parameter tuning and performance engineering.

SelfTune is hosted as a service on the public Azure cloud. First-party applications that are interested in post-deployment parameter tuning can use RestAPI calls to access SelfTune endpoints. The SelfTune framework has two components:

  1. Client API provides necessary support to access the SelfTune endpoints via RestAPI calls, namely, Predict for getting the parameters from the framework and SetReward for providing reward/feedback to the framework.
  2. RL Engine implements a suite of ML/RL algorithms for periodically updating the parameters and returning the latest values to the clients as well as for periodically computing the reward metrics.

At the core of the SelfTune framework is the formulation of the post-deployment parameter tuning problem as that of “online learning from bandit feedback.” SelfTune assumes that the only interaction possible with the external system (i.e., the application being tuned) is a black-box access to some form of feedback (e.g., daily P95 latency of the service). The framework repeatedly deploys configuration parameters and observes the corresponding rewards after a developer-defined period. As the operational environment (e.g., production cluster running certain types of workloads) is constantly in flux, there is no single setting of parameters that will remain optimal throughout. Thus, SelfTune continuously runs the explore-exploit paradigm of RL techniques – explore new parameters in the vicinity of the currently deployed parameters, observe rewards, update its internal model based on the reward, and exploit parameters that tend to give high rewards over time.

We have designed a bandit learning algorithm called Bluefinin SelfTune that crystallizes the aforementioned idea. Our algorithm has lower sample complexity, which means it takes a lower number of rounds for the algorithm to converge to desired values when we want to tune multiple real-valued parameters simultaneously, compared to peer techniques like multi-arm bandits (which is the base of Azure Personalizer), Bayesian Optimization (used by the MLOS framework), or genetic algorithms. This is provable under some assumptions on the reward function, but we observe, across multiple deployments, that the algorithm converges to good solutions in practice even when theoretical assumptions are often violated.

We have open-sourced Bluefin through Vowpal Wabbit, a popular RL library for practitioners, which houses the core algorithms of Azure Personalizer. We are continuing to work on designing vertical RL algorithms and horizontal feature learning for the systems domain. Besides Bluefin, SelfTune supports a suite of black-box optimization (e.g. Bayesian Optimization) and RL techniques (e.g., Deep Deterministic Policy Gradients) that the cloud applications can choose from, based on their needs.

A simple integration use case: Consider the scenario of setting PySpark cluster configuration parameters for Azure ML jobs that are spawned for ML workloads in the O365 MS-AI organization. The workloads are composed of various data processing jobs and run on various Azure ML clusters with different capacities and hardware. It is non-trivial to set parameters for various jobs, such that the workloads complete quickly, and not fail in the middle due to resourcing issues thereby losing all computations.

Basic SelfTune workflow: The basic integration of SelfTune in the AzureML pipeline is illustrated in the figure below. Here, the developer wants to tune seven key Apache PySpark parameters per job, namely driver memory, driver cores, executor cores, number executors, executor memory, spark.sql.shuffle.partitions, and spark.default.parallelism.

Basic SelfTune workflow
  1. Developer invokes Predict on the SelfTune instance, asking for the parameters for the next job.
  2. SelfTune service responds with the predicted parameters for the next job.
  3. The developer submits a job using SelfTune’s predicted parameters. //outside SelfTune’s purview
  4. Once the job is complete, the cluster sends job meta data to the data store. // outside SelfTune’s purview
  5. Developer queries rewards for previously completed jobs, if any, from Data Store (e.g., Azure ML workspace).
  6. Data Store responds with the rewards (e.g., job completion times, which is part of the job meta-data) from previously completed jobs.
  7. If the rewards exist in the store, the developer invokes SetReward for those jobs (which pushes the rewards to the SelfTune service endpoint hosted somewhere).

Self-tuning substrate background jobs scheduler

User-level background job scheduling: All the substrate backend servers in EXO datacenters (that host user mailboxes) run hundreds of low-priority, latency-insensitive, periodic workloads locally (e.g., mailbox replication, encryption, event-driven assistants, etc.). Workload Management (WLM) is a core substrate service that runs on all such backend servers. It helps with the user-level scheduling of workloads on the servers: a) with the goal of completing the tasks when resources become available (at micro-granular timescales), and b) mindful of the fact that high-priority, latency-sensitive workloads will bypass this scheduler. Thus, ensuring high availability of resources especially during peak hours is critical, besides meeting workload SLAs.

Tuning real-valued configuration parameters: The scheduler is implemented today as part of a huge codebase in the substrate core. The scheduler trades off resource utilization and completion rates by dynamically ramping up and ramping down the number of concurrent background tasks requiring access for the resources. This is achieved by carefully setting several configuration settings (hundreds of real-valued parameters). At a server level, we can achieve better resource utilization and throughput, by automatically tuning the key parameters, based on the workloads it receives and the ensuing resource health fluctuations.

Impact of using SelfTune in WLM: We have integrated SelfTune with the substrate background scheduler codebase (the change required is simple, on the order of tens of lines of code, as shown in the figure below). We first deployed in the inner rings of substrate (over 3000+ servers). The results gathered over 4-5 weeks of deployment clearly indicate that tuning helps on most of the deployed servers, increasing throughput at least 20% across multiple forests in their heavily throttled servers, with a marginal increase in CPU health and insignificant-to-mild degradation of disk health. Based on this validation, we have now rolled out SelfTune integration to most EXO backend servers (nearly 200,000) across the worldwide production ring.

SelfTune Application library contains the SelfTune client API and the RL/ML algorithms

Ongoing work and future AI+systems research

SelfTune is a general platform and can be readily applied to many RL-for-cloud scenarios without any additional feature engineering or onboarding efforts (which are typically required in AIOps). We expect that developers can define a suitable spatial and temporal tuning scope in the service/system, tuning the parameters of the service running in the cluster, at the level of machines, every hour of every day. Thus, instead of hand-coding the optimal operating points for various machines or various clusters that the service operates in, we could integrate SelfTune in the service codebase to dynamically figure them out over time, based on the real-time feedback at a determined temporal granularity.

Our work poses a lot of interesting design and algorithmic questions in this space. For instance, can we automatically scope the tuning problem based on some observed context such as cluster type, hardware, workload volumes, etc., and find optimal parameters per scope? Given that typical cloud applications have hundreds, if not thousands, of knobs to tune, can we automatically identify the knobs that impact the performance metric of interest, and then tune those knobs more efficiently?

A combination of system insights, ML formulations, and cross-layer optimization is vital for effective post-deployment management of cloud applications and services. We will post an update to this blog post on our ongoing work in this space soon. Meanwhile, the final blog post in this series will explore how AIOps can be made more comprehensive by spanning the entire cloud stack.

The post Automatic post-deployment management of cloud applications appeared first on Microsoft Research.

Read More

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

nsdi'23 on a red background with

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: Microsoft Research Podcast

AI Frontiers: AI for health and the future of research with Peter Lee

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Read More

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems

nsdi'23 on a red background with

Microsoft has made significant contributions to the prestigious USENIX NSDI’23 conference, which brings together experts in computer networks and distributed systems. A silver sponsor for the conference, Microsoft is a leader in developing innovative technologies for networking, and we are proud to have contributed to 30 papers accepted this year. Our team members also served on the program committee, highlighting our commitment to advancing the field.

The accepted research papers span a wide range of topics, including networking for AI workloads, cloud networking, WAN, and wireless networks. These papers showcase some of the latest advancements in networking research.

The paper, “DOTE: Rethinking (Predictive) WAN Traffic Engineering”, which revisits traffic engineering in the Wide Area Network (WAN), was selected for one of the Best Paper Awards at the conference. This work was done jointly by researchers at Microsoft, along with academics at Hebrew University of Jerusalem and Technion.

Some other innovations on cloud networking infrastructure include:

Empowering Azure Storage with RDMA, which presents the findings from deploying intra-region Remote Direct Memory Access (RDMA) to support storage workloads in Azure. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings. This research is a testament to Microsoft’s ongoing commitment to providing customers with the best possible user experience.

Disaggregating Stateful Network Functions, which introduces a new approach for better reliability and performance at a lower per-server cost for cloud users. The core idea is to move the network function processing off individual servers and into shared resource pools. This technology is now shipping as part of Microsoft Azure Accelerated Connections.

Our colleagues from Microsoft Research Asia, will present ARK: GPU-driven Code Execution for Distributed Deep Learning, which overcomes the overhead of GPU communication for large deep learning workloads by having GPUs run their code, and handle communication events autonomously, without CPU intervention.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

Microsoft’s collective contributions to the USENIX NSDI’23 conference highlight our commitment to advancing the field of networking research and developing innovative solutions to real-world networking problems, leveraging strong academic collaborations. We look forward to continuing to push the boundaries of what is possible in networking research and delivering cutting-edge solutions to our customers.

A complete list of Microsoft papers accepted at USENIX NSDI’23:

  1. Understanding RDMA Microarchitecture Resources for Performance Isolation, Xinhao Kong and Jingrong Chen, Duke University; Wei Bai, Microsoft; Yechen Xu, Shanghai Jiao Tong University; Mahmoud Elhaddad, Shachar Raindel, and Jitendra Padhye, Microsoft; Alvin R. Lebeck and Danyang Zhuo, Duke University
  2. Empowering Azure Storage with RDMA, Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
  3. ARK: GPU-driven Code Execution for Distributed Deep Learning, Changho Hwang, KAIST, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
  4. Hydra: Serialization-Free Network Ordering for Strongly Consistent Distributed Applications, Inho Choi, National University of Singapore; Ellis Michael, University of Washington; Yunfan Li, National University of Singapore; Dan R. K. Ports, Microsoft Research; Jialin Li, National University of Singapore
  5. Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication, Mohammadreza Alimadadi and Hieu Mai, Stony Brook University; Shenghsun Cho, Microsoft; Michael Ferdman, Peter Milder, and Shuai Mu, Stony Brook University
  6. Scalable Distributed Massive MIMO Baseband Processing, Junzhi Gong, Harvard University; Anuj Kalia, Microsoft; Minlan Yu, Harvard University
  7. Unlocking unallocated cloud capacity for long, uninterruptible workloads, Anup Agarwal, Carnegie Mellon University; Shadi Noghabi, Microsoft Research; Íñigo Goiri, Azure Systems Research; Srinivasan Seshan, Carnegie Mellon University; Anirudh Badam, Microsoft Research
  8. Invisinets: Removing Networking from Cloud Networks, Sarah McClure and Zeke Medley, UC Berkeley; Deepak Bansal and Karthick Jayaraman, Microsoft; Ashok Narayanan, Google; Jitendra Padhye, Microsoft; Sylvia Ratnasamy, UC Berkeley and Google; Anees Shaikh, Google; Rishabh Tewari, Microsoft
  9. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs, John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
  10. OneWAN is better than two: Unifying a split WAN architecture, Umesh Krishnaswamy, Microsoft; Rachee Singh, Microsoft and Cornell University; Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, and Erica Lan, Microsoft
  11. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches, Aashaka Shah, University of Texas at Austin; Vijay Chidambaram, University of Texas at Austin and VMware Research; Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi, Microsoft Research; Rachee Singh, Microsoft and Cornell University
  12. Synthesizing Runtime Programmable Switch Updates, Yiming Qiu, Rice University; Ryan Beckett, Microsoft; Ang Chen, Rice University
  13. Formal Methods for Network Performance Analysis, Mina Tahmasbi Arashloo, University of Waterloo; Ryan Beckett, Microsoft Research; Rachit Agarwal, Cornell University
  14. Scalable Tail Latency Estimation for Data Center Networks, Kevin Zhao, University of Washington; Prateesh Goyal, Microsoft Research; Mohammad Alizadeh, MIT CSAIL; Thomas E. Anderson, University of Washington
  15. Addax: A fast, private, and accountable ad exchange infrastructure, Ke Zhong, Yiping Ma, and Yifeng Mao, University of Pennsylvania; Sebastian Angel, University of Pennsylvania & Microsoft Research
  16. RECL: Responsive Resource-Efficient Continuous Learning for Video Analytics, Mehrdad Khani, MIT CSAIL and Microsoft; Ganesh Ananthanarayanan and Kevin Hsieh, Microsoft; Junchen Jiang, University of Chicago; Ravi Netravali, Princeton University; Yuanchao Shu, Zhejiang University; Mohammad Alizadeh, MIT CSAIL; Victor Bahl, Microsoft
  17. Tambur: Efficient loss recovery for videoconferencing via streaming codes, Michael Rudow, Carnegie Mellon University; Francis Y. Yan, Microsoft Research; Abhishek Kumar, Carnegie Mellon University; Ganesh Ananthanarayanan and Martin Ellis, Microsoft; K.V. Rashmi, Carnegie Mellon University
  18. Gemel: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge, Arthi Padmanabhan, UCLA; Neil Agarwal, Princeton University; Anand Iyer and Ganesh Ananthanarayanan, Microsoft Research; Yuanchao Shu, Zhejiang University; Nikolaos Karianakis, Microsoft Research; Guoqing Harry Xu, UCLA; Ravi Netravali, Princeton University
  19. On Modular Learning of Distributed Systems for Predicting End-to-End Latency, Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
  20. SelfTune: Tuning Cluster Managers, Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
  21. OpenLoRa: Validating LoRa Implementations through an Extensible and Open-sourced Framework, Manan Mishra, Daniel Koch, Muhammad Osama Shahid, and Bhuvana Krishnaswamy, University of Wisconsin-Madison; Krishna Chintalapudi, Microsoft Research; Suman Banerjee, University of Wisconsin-Madison
  22. ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, Daehyeok Kim, Microsoft and University of Texas at Austin; Vyas Sekar and Srinivasan Seshan, Carnegie Mellon University
  23. Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches, Hun Namkung, Carnegie Mellon University; Zaoxing Liu, Boston University; Daehyeok Kim, Microsoft Research; Vyas Sekar and Peter Steenkiste, Carnegie Mellon University
  24. Acoustic Sensing and Communication Using Metasurface, Yongzhao Zhang, Yezhou Wang, and Lanqing Yang, Shanghai Jiao Tong University; Mei Wang, UT Austin; Yi-Chao Chen, Shanghai Jiao Tong University and Microsoft Research Asia; Lili Qiu, UT Austin and Microsoft Research Asia; Yihong Liu, University of Glasgow; Guangtao Xue and Jiadi Yu, Shanghai Jiao Tong University
  25. Disaggregating Stateful Network Functions, Deepak Bansal, Gerald DeGrace, Rishabh Tewari, Michal Zygmunt, and James Grantham, Microsoft; Silvano Gai, Mario Baldi, Krishna Doddapaneni, Arun Selvarajan, Arunkumar Arumugam, and Balakrishnan Raman, AMD Pensando; Avijit Gupta, Sachin Jain, Deven Jagasia, Evan Langlais, Pranjal Srivastava, Rishiraj Hazarika, Neeraj Motwani, Soumya Tiwari, Stewart Grant, Ranveer Chandra, and Srikanth Kandula, Microsoft
  26. Doing More with Less: Orchestrating Serverless Applications without an Orchestrator, David H. Liu and Amit Levy, Princeton University; Shadi Noghabi and Sebastian Burckhardt, Microsoft Research
  27. NetPanel: Traffic Measurement of Exchange Online Service, Yu Chen, Microsoft 365, China; Liqun Li and Yu Kang, Microsoft Research, China; Boyang Zheng, Yehan Wang, More Zhou, Yuchao Dai, and Zhenguo Yang, Microsoft 365, China; Brad Rutkowski and Jeff Mealiffe, Microsoft 365, USA; Qingwei Lin, Microsoft Research, China
  28. DOTE: Rethinking (Predictive) WAN Traffic Engineering, Yarin Perry, Hebrew University of Jerusalem; Felipe Vieira Frujeri, Microsoft Research; Chaim Hoch, Hebrew University of Jerusalem; Srikanth Kandula and Ishai Menache, Microsoft Research; Michael Schapira, Hebrew University of Jerusalem; Aviv Tamar, Technion
  29. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker, Yinfang Chen and Xudong Sun, University of Illinois at Urbana-Champaign; Suman Nath, Microsoft Research; Ze Yang and Tianyin Xu, University of Illinois at Urbana-Champaign
  30. Test Coverage for Network Configurations, Xieyang Xu and Weixin Deng, University of Washington; Ryan Beckett, Microsoft; Ratul Mahajan, University of Washington; David Walker, Princeton University

NSDI 2023 Program Committee members:

Members of other committees:

The post Microsoft at NSDI 2023: A commitment to advancing networking and distributed systems appeared first on Microsoft Research.

Read More

Hunting speculative information leaks with Revizor

Hunting speculative information leaks with Revizor

Revizor chart

Spectre and Meltdown are two security vulnerabilities that affect the vast majority of CPUs in use today. CPUs, or central processing units, act as the brains of a computer, directing the functions of its other components. By targeting a feature of the CPU implementation that optimizes performance, attackers could access sensitive data previously considered inaccessible. 

For example, Spectre exploits speculative execution—an aggressive strategy for increasing processing speed by postponing certain security checks. But it turns out that before the CPU performs the security check, attackers might have already extracted secrets via so-called side-channels. This vulnerability went undetected for years before it was discovered and mitigated in 2018. Security researchers warned that thieves could use it to target countless computers, phones and mobile devices. Researchers began hunting for more vulnerabilities, and they continue to find them. But this process is manual and progress came slowly. With no tools available to help them search, researchers had to analyze documentation, read through patents, and experiment with different CPU generations. 

A group of researchers from Microsoft and academic partners began exploring a method for systematically finding and analyzing CPU vulnerabilities. This effort would produce a tool called Revizor (REV-izz-or), which automatically detects microarchitectural leakage in CPUs—with no prior knowledge about the internal CPU components. Revizor achieves this by differentiating between expected and unexpected information leaks on the CPU. 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

The Revizor process begins by describing what is expected from the CPU in a so-called “leakage contract.” Revizor then searches the CPU to find any violations of this contract. It creates random programs, runs them on the CPU, records the information they expose, and compares the information with the contract. When it finds a mismatch that violates the contract, it reports it as a potential vulnerability. 

Details were published in 2022 in the paper: Revizor: Testing Black-box CPUs against Speculation Contracts

To demonstrate Revizor’s effectiveness, the researchers tested a handful of commercial CPUs and found several known vulnerabilities, including Spectre, MDS, and LVI, as well as several previously unknown variants. 

However, the search was still slow, which hindered the discovery of entirely new classes of leaks. The team identified the root causes of the performance limitations, and proposed techniques to overcome them, improving the testing speed by up to two orders of magnitude. The improvements are described in a newly published paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

These improvements supported a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, the researchers found two types of previously unknown speculative leaks (affecting string comparison and division) that had escaped previous analyses—both manual and automated. These results show that work which previously required persistent hacking and painstaking manual labor can now be automated and rapidly accelerated. 

The team began working with the Microsoft Security Response Center and hardware vendors, and together they continue to find vulnerabilities so they can be closed before they are discovered by hackers—thereby protecting customers from risk. 

Revizor is part of Project Venice, which investigates novel mechanisms for the secure sharing and partitioning of computing resources, together with techniques for specifying and rigorously validating their resilience to side-channel attacks. 

The post Hunting speculative information leaks with Revizor appeared first on Microsoft Research.

Read More

Hunting speculative information leaks with Revizor

Hunting speculative information leaks with Revizor

Revizor chart

Spectre and Meltdown are two security vulnerabilities that affect the vast majority of CPUs in use today. CPUs, or central processing units, act as the brains of a computer, directing the functions of its other components. By targeting a feature of the CPU implementation that optimizes performance, attackers could access sensitive data previously considered inaccessible. 

For example, Spectre exploits speculative execution—an aggressive strategy for increasing processing speed by postponing certain security checks. But it turns out that before the CPU performs the security check, attackers might have already extracted secrets via so-called side-channels. This vulnerability went undetected for years before it was discovered and mitigated in 2018. Security researchers warned that thieves could use it to target countless computers, phones and mobile devices. Researchers began hunting for more vulnerabilities, and they continue to find them. But this process is manual and progress came slowly. With no tools available to help them search, researchers had to analyze documentation, read through patents, and experiment with different CPU generations. 

A group of researchers from Microsoft and academic partners began exploring a method for systematically finding and analyzing CPU vulnerabilities. This effort would produce a tool called Revizor (REV-izz-or), which automatically detects microarchitectural leakage in CPUs—with no prior knowledge about the internal CPU components. Revizor achieves this by differentiating between expected and unexpected information leaks on the CPU. 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

The Revizor process begins by describing what is expected from the CPU in a so-called “leakage contract.” Revizor then searches the CPU to find any violations of this contract. It creates random programs, runs them on the CPU, records the information they expose, and compares the information with the contract. When it finds a mismatch that violates the contract, it reports it as a potential vulnerability. 

Details were published in 2022 in the paper: Revizor: Testing Black-box CPUs against Speculation Contracts

To demonstrate Revizor’s effectiveness, the researchers tested a handful of commercial CPUs and found several known vulnerabilities, including Spectre, MDS, and LVI, as well as several previously unknown variants. 

However, the search was still slow, which hindered the discovery of entirely new classes of leaks. The team identified the root causes of the performance limitations, and proposed techniques to overcome them, improving the testing speed by up to two orders of magnitude. The improvements are described in a newly published paper: Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

These improvements supported a testing campaign of unprecedented depth on Intel and AMD CPUs. In the process, the researchers found two types of previously unknown speculative leaks (affecting string comparison and division) that had escaped previous analyses—both manual and automated. These results show that work which previously required persistent hacking and painstaking manual labor can now be automated and rapidly accelerated. 

The team began working with the Microsoft Security Response Center and hardware vendors, and together they continue to find vulnerabilities so they can be closed before they are discovered by hackers—thereby protecting customers from risk. 

Revizor is part of Project Venice, which investigates novel mechanisms for the secure sharing and partitioning of computing resources, together with techniques for specifying and rigorously validating their resilience to side-channel attacks. 

The post Hunting speculative information leaks with Revizor appeared first on Microsoft Research.

Read More

AI Frontiers: Models and Systems with Ece Kamar

AI Frontiers: Models and Systems with Ece Kamar

black and white photo of Ece Kamar, Partner Research Manager at Microsoft Research, next to the Microsoft Research Podcast

Episode 138 | April 13, 2023

Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.

In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.

The third episode features Ece Kamar, deputy lab director at Microsoft Research Redmond. Kamar draws on decades of experience in AI research and an opportunity she and Microsoft colleagues had to evaluate and experiment with GPT-4 prior to its release in discussing the capabilities and limitations of today’s large-scale models. She explores the short-term mitigation techniques she and her team are using to make these models viable components of the AI systems that give them purpose and shares the long-term research questions that will help maximize their value. 

Transcript

[MUSIC PLAYS]

Ashley Llorens: I’m Ashley Llorens with Microsoft Research. I’ve spent the last 20 years working in AI and machine learning, but I’ve never felt more fortunate to work in the field than at this moment. The development of increasingly powerful large-scale models is accelerating the advancement of AI. Most recently, GPT-4 is exhibiting surprising new abilities like problem-solving and translation across languages and domains.

In this podcast series, I’ll share conversations with fellow researchers about our impressions of GPT-4, the nature of intelligence, and ultimately how innovations like these can have the greatest benefit for humanity.

Today we’re sitting down with Ece Kamar, deputy lab director at Microsoft Research in Redmond. In the months leading up to the release of GPT-4, Ece and her team leverage their many years of experience in AI research to evaluate the model and to help understand and mitigate its limitations.

So the experiences it powers can bring the greatest benefit to the people that use them.


Welcome to AI Frontiers.

All right, why don’t we just jump right in.

Ece Kamar: Okay.

Llorens: Okay.

Kamar: Take it over.

[MUSIC FADES]

Llorens: All right, so I want to start at a place that I think will be close to your heart, and that is with the difference between a model and a system. But let me, let me paint the picture a little bit, right. So machine learning is a process through which we create something called a model, which is learned from data. The model is a kind of program that maps inputs to outputs, be it language, images, etc. In deep learning, the models are some variant of an artificial neural network. And finally, in the current era of large-scale AI, these models can have hundreds of billions of parameters or more. But there’s a model, and then there’s a system. The system is the thing that gets deployed when we put out a product or something. So, Ece, from your perspective, what’s the difference between a model as described here and a system?

Ece Kamar: Yeah, that’s, that’s something that I’m thinking so much about these days because we are all getting very excited about the emerging capabilities we see in the latest models—what they can do, what kind of questions we can ask them, the generalizability, the interactive power, even some of the reasoning capabilities that are surprising just to be able to get them with that input-output mapping that, Ashley, you’ve been talking about. However, when you think about it, these models on their own, they don’t really have a purpose. They are just trying to replicate what they have seen in these massive data sources. And the thing that has been driving me as a researcher, even from my earlier days, has been the purpose: why are we building technology, and what is the purpose behind it? And the main difference between a system and a model is a system has a purpose. We build these systems for a particular reason that—in particular, the reason I care very much about is providing value to people who use these systems. So in terms of that distinction, I am spending a lot of time these days thinking about system design with the purpose of enabling, augmenting people, and these systems will have these latest models as building blocks. No question about it. They are so powerful in terms of synthesizing information, having a cohesive, interesting conversation. But at the same time, they are not enough. To be helpful to people, we have additional capabilities like knowing about that individual, learning from that individual, having an understanding of the goals that the individual would like to have. So we are trying to get to that system architecture, the system design that can actually make that input-output model a very crucial part of a much bigger, uh, purpose.

Llorens: Maybe next we can go into the system lifecycle. So there’s a way that a system component like a model becomes part, uh, of a larger system that eventually gets deployed. So tell me about that lifecycle. What’s that like from your experience?

Kamar: From my experience, actually, the larger system you really care about is the hybrid human-AI system because at the end of the day, what we really care about is not how great a system is alone, like an AI system is alone, but we care very much about how well that partnership is working between the human and the AI system. And right now, we have some systems out in the world that are actually already providing a lot of value for. For example, Copilot is a great example of this—the GitHub Copilot—where as you’re writing code, it can make suggestions for you and you can accept or reject them. At the same time, this is really missing some very crucial abilities in it because we are still in the very early days of this copilot-AI revolution. So what are some of the capabilities we are missing? Copilot still doesn’t really have a very good understanding of me as a developer. What are the particular habits I have? What kind of code I love to write? Maybe I care very much about the interpretability of my code by others when I’m not in that project anymore. It is not necessarily a preference that Copilot has about me. I think soon enough it will because I think we are going to get to a world where these AI systems will know a lot about us, our goals, our preferences, our intentions, our habits. And then they are going to become a lot more helpful to us. The other thing that’s not happening with the current systems is that they are not learning from feedback. As individuals, when we are part of teams—let’s say I’m working with you, which we do, all the time—I learn about you; you give me your feedback. You say, “Next time, don’t do that. Maybe don’t consider doing it that way.” I take that into account. I get better at what I do because I learn from you. So the more we build these self-feeding feedback loops into our AI systems, they are going to have a better understanding of us as users, but also they are going to be able to provide more value for us.

Llorens: The first time I used GPT-4, I asked it a question that was inspired by my own work in underwater robotics. I asked it how far away I could hear a sound generated underwater in the ocean. The response took me completely by surprise. The model pointed out that more information was needed, like how temperature would affect the speed of sound through the water. It suggested I consider using a sonar array. It went ahead and made its own assumptions about those things and then gave me an answer. The reasoning was breathtaking to me. I knew for a fact it hadn’t been explicitly trained to do any of that. It challenged my notion of the value of being able to do this kind of reasoning as a researcher.

So maybe we can actually start with the model and your experience of it. The capabilities and limitations. But why don’t we just start with your first impressions of it?

Kamar: It was surprising, mainly because I have been working in the AI space for almost like, I don’t want to say it, but two decades. So we have been all thinking about new models, new architectures, what is coming in AI; we always had in mind these kind of ambitious goals for AI. For me, it has always been these AI assistants that come and help us with whatever we are doing, even from the early days it has been that. But always that aspiration never really landed because when we tried to build these systems, they became narrow that they did not really match what, as users, we needed from them. And when I saw GPT-4 and started interacting with it, I saw some mind-blowing capabilities that I thought I wouldn’t see for many years to come. And one of the surprises was how quickly we get here. So that’s kind of No. 1. And we can talk a lot more about like what are those surprising abilities, but second, immediately, my mind goes to, what can we do with this? Because first of all, there’s so much potential now we have in terms of bringing that vision of helping people into reality.

But second of all, because I also care a lot about responsibility, “Oh, my god, this powerful model will come with so much responsibility.” What, as Microsoft, we build with this plus what others will be able to build with this model or maybe models [that] will come next, that’s going to matter a lot for not only for us as researchers, not only for users, but our society overall.

So the other reaction I had was like, what can go wrong and what can we do to prevent those negative consequences from happening? And that’s going to be a long journey that we are going to be on.

Llorens: Sure. Let’s get further into those surprising capabilities.

Kamar: Yeah, sure. So one of the very surprising capabilities is how general purpose these models are at the moment. I can prompt it to write code, write poems. I can ask—I’m Turkish. I can ask questions in Turkish and I can get very fluid responses in Turkish. It can actually write me beautiful poems about sunset in Cappadocia in Turkish, which is like, oh my god, this is already creating an emotional reaction, right, when I’m interacting with it. And then, though, you get into much more practical tasks. For example, how do I turn some of my thoughts into, into writing? Um, how can I customize my voice for different audiences? And the model seems to know about these things and can help me, but not producing a final result, but bringing me to a point where I can be a lot more productive.

So that general-purpose nature of it, like I can go from writing a poem—which I’m terrible at it—to writing academic papers—I think I’m better at that—and helping me throughout the spectrum when I’m not good at something, when I’m kind of good at something. That is just showing me so much potential, such a big spectrum.

But the other thing is the interactivity. It is not this static tool where I basically ask one thing, it gives me one answer, and I’m kind of done, like whatever I can do with that one turn is all I get. It is actually the opposite. It gives me a response and I can actually instruct it further. I can talk about my preferences, how I would like that to be changed for something that’s a much better fit for my needs.

And as a person, I may not be able to articulate my needs at the beginning clearly, so that interaction of being able to see what it can do and asking further is just making it a much, much more capable tool. And the other thing is the reasoning capabilities. What I mean by that is that, you know, for the last few years, as these larger models came out and came out, we all said, OK, this is pretty powerful, but it is still just like repeating patterns it has seen in the, in the internet. And one of the—you know, I think some of my colleagues used the term—was “stochastic parrots.” It’s just repeating things back to you. And what we are seeing with GPT-4—and I think it’s just the phase transition; we’re at this point in this phase transition and these capabilities are going to get stronger and stronger—is that the capabilities for synthesis, compiling information together to get into new insights that may not exist. I’m not claiming all of those insights are correct, but they are giving people sparks that they can further think about and build on. Also, it can reason about multiple steps. It’s not a planner yet, but it has the basics of top-level reasoning where we can start from a point towards the goal and we can collaborate to work towards a plan to get there.

And those are all very powerful things, especially when we think about building an AI system that can take somebody’s goals and turn them into actions.

Llorens: So you mentioned, planning as a limitation of the model, but let’s just talk about, you know, maybe more fully about the limitations that, that you see in, the in the current, current model, current state of the art.

Kamar: You know, a lot of people, when they think about these limitations, they see these as reasons not to invest in these technologies at all. I look at it from a different perspective. I see these as pieces of the puzzle that we need to invent and put in place. So we started this conversation with the distinction between the model and the system. The model is a very powerful piece of this puzzle, but we are also, as we are building these systems—like Bing is a great example, the GitHub Copilot is another example—we are seeing what they can do, but we are seeing a lot about what they cannot do, and that is giving us, as researchers, ideas about new puzzle pieces we need to invent so that we can come to this architecture.

So a huge limitation, hallucinations. I think that is top of mind for a lot of us. These models are learning from large datasets on the internet, they don’t have fresh information. They are not able to separate reliable information from unreliable information. And also because these models are general-purpose tools, sometimes we want to use them for creating something new that doesn’t exist on the internet, for example, writing a brand-new poem that nobody else wrote before. But sometimes you want them as information retrieval engines, where the biggest requirement is being correct in terms of that information coming back. So we are all learning, like, how can we understand the purpose, turn it into prompts, and then figure out the best way to instruct these models so that, so that we are getting our desired behavior in return, but also how can we actually, in the future, specialize these models in a way that we can have versions that are much less prone to hallucinations?

How can we ground them with the right context and know how to communicate that intent well, so that I can be assured that whenever they are giving me information, giving me facts when I need the facts, they are giving me the right facts? We are at the very beginning of solving this puzzle. But in my mind, this is not a limitation.

This is actually showing me a lot of problems, research problems, to invest in.

Llorens: So, Ece, you’re a leader here at Microsoft Research. You’ve got a team, and your team, uh, is instrumental in this process of turning the model into a system, uh, for some of these applications. And I guess you’ve talked about understanding the purpose—systems have a purpose—and maybe there’s aspects of the system design that mitigate or deal with some of the limitations in order to make it fit for that purpose.

You mentioned grounding, for example, as one of those methods, but can you just get deeper maybe into grounding and some of the other techniques that you use to, again, turn the model into a system?

Kamar: Yeah, definitely. We have been working with different teams across Microsoft as some of these technologies find their way into products, both understanding the limitations but also helping to overcome those limitations, um, with existing techniques. Of course, there’s a lot to be invented, but right now we still have some things in our capabilities list that we can apply to make these problems mitigated, up to some extent.

Just to give a few examples, right, when we are giving search results, instead of just using GPT-4 to produce that result, we are actually getting better, more accurate results when the top search results are provided as context to the models for them to create their generations. So that is one technique that is currently implemented. That is an example of grounding, grounding with that context. You can imagine that for another application, let’s say for writing an email for you, here we can ground with your past emails written to the same person; we can ground based on your personal documents. For example, if I’m writing you an email about this podcast, you probably have an outline or a document where we have previously discussed some of these ideas. That becomes important grounding so that that email represents my voice, represents my thoughts, but it actually becomes a way for me to just do things faster and more efficiently. So those are some examples of the grounding. The other thing we have in our toolbox these days is how we talk to the model. This is called prompting. A lot of people are talking about prompting because we are discovering new ways to communicate with these models as developers.

If you remember back in the day, um, the way a developer would talk to a machine learning model was giving labeled data. Here’s an example: True, false. Here’s an example: True, false. Now our communication channel with the model in terms of developing systems is increasing. Our bandwidth is so much higher. Now we can talk to the model in natural language.

The problem with this is this is, uh, not a perfect specification. However, still, the way I can instruct the model carries a lot of power. So when we are building systems with prompting, we can tell the model, instruct the model, that whenever the model is talking about a fact, it should cite the source of that material. This has two particular benefits. One benefit is that this is instructing the model that everything the model says should be coming from a source and the links should be there. Of course, I’m not claiming that we are doing this perfectly, but it’s a step in that direction. But second, and even the more important reason is, we are giving people accountability to check. As I said, none of the systems we are trying to build are there to automate the role of the human being.

It is all about complementarity and augmentation and enablement. So when we are building a system, giving results to the human, the goal is always having the human in the driver’s seat, having the human control what is being generated, and by providing sources in the results, that is one way we can enable the user, because then the user can go to these links and check.

These are just some of the things that we are currently inventing as, you know, short-term ideas to mitigate these problems as much as possible. But also we have to think about long-term solutions that can really make these problems go away. We are not there yet, but as a researcher, I’m very excited about the potential.

Llorens: I’d love to just drill into this notion of specification for a moment. You mentioned the complementarity, you mentioned the intent to have these systems amplify human agency, and with that stewardship of the system comes the expression of intent. And you know, you mentioned maybe even in the era before machine learning, the way to express intent was through a very explicitly written program and, you know, kind of machine learning for more narrow systems, it’s identifying labels for data. And now we have natural language as a means of specification, and you called it an imperfect means of specification. So can you just maybe take us a little deeper into that thought?

Kamar: Yeah. So we have been talking about what we are seeing in the latest models in GPT-4 as a phase transition. We haven’t arrived at the best possible model, and we haven’t arrived at the best possible way to communicate with that model. We are at this very specific point in our history where we are saying, “OK, our models are getting really capable and that communication channel has opened up.

Now I can talk to it in natural language.” I personally don’t think that this very noisy way of just communicating things in natural language as a way of prompts is the final product of how we are going to be talking to our AI systems. However, it is a way, and with iteration, we can become more precise. So let me tell you this.

Let’s say I want this AI system to write me an email to you. The simple prompt could be, “Write me an email to Ashley, and it should talk about this and this.” I can see the result. Immediately, I can see what I don’t like about it. Imagine I could say more specification, right, I can say, “Oh, don’t mention this; include this, as well. The tone should be this way and not that way.”

These are all additional specifications I may not think about when I’m just prompting the model, but over time, I may get better and better in terms of really specifying my preferences, my intent. So right now, we’re in this very noisy process of almost like trial and error. We are trying something, looking at the result; if we don’t like it, we come up with a correction. I think over time we can really compile these experiences—how people are specifying things into these models—and that can pave the way for much better communication methods. Again, we don’t have the answers yet, but I’m also, I’m also not thinking that these prompts are the best way to interact.

Llorens: And as I learn to specify my intent to a particular model, how much does that knowledge or that skill of prompting this model in an effective way translate when I pick up another model or maybe, you know, another iteration on the same model. Do I have to relearn it every time?

Kamar: Ideally not, because we all want to be consistent. Uh, we don’t want our experiences to go away just because we are starting over with a new model. Again, so far, a lot of the model developments have been guided by numbers—how big the models are, how accurate they are, how did they do on certain benchmarks. Now, as these models are enabling real systems for humans, we need to bring in other criteria that are human-centered, that can not only be explained by how well you predict the next word, but it is about what you said. How can I get consistency in the way I communicate with this model? How does this model learn better about me? How this model can capture the right context about me? So I think we are at the beginning of understanding those human-centered considerations we want to have in these models and somehow incorporate them into the way these models are trained.

Llorens: Earlier you mentioned responsibility, you know, that, that Microsoft, you know, has a responsibility, you know, when we put these systems out in the world. As researchers and engineers, um, we have some stewardship of that responsibility in the design process, and throughout the lifecycle. How has that manifested here, you know, for GPT-4 in the applications that you’ve worked on? How does that aspect of responsibility enter into the system design and engineering for you?

Kamar: In a very similar way to how we have been thinking about responsible AI for the last five, six years. It is a journey, and with every model, including GPT-4, the first step, is understanding—understanding the capabilities, understanding the limitations, understanding what can go wrong and what can we do in a short term to prevent those negative effects to be as little as possible.

So from the early days of Microsoft’s interaction with GPT-4, uh, me and many of my colleagues have been involved. We started playing with it. We started observing what it can do, what it cannot do, started documenting all of those capabilities. And now you need to take a step back and say, “OK, what can I say about the risks?” Because you observe the instances, but there are these higher-level risks that you should be considerate about. So it became obvious that hallucination was an issue. The other issue is something we call manipulation. The fact that these models don’t have a good understanding of what they don’t know, but at the same time, they can also not admit that they don’t have the right answer, and they may actually even try to convince you as the user that what they are providing is the right one.

So we started thinking what kind of mitigations we can bring in place to make these problems as little as possible. Of course, another consideration is offensive language, biases, content moderation. So that’s another, another factor that a lot of my colleagues have been involved with from the early days. And we worked closely across the company in terms of putting practices in place.

Sometimes this is content moderation modules. Sometimes this is prompt engineering to get hallucinations to be as low as possible. Sometimes it is really thinking about those high-level guidelines you can give to the systems to make these risks as low as possible. So we have been very heavily involved from the beginning, and we are also putting our ideas into publications to share with the wider world, because not everybody—we are aware that not everybody will have as much experience as we have with these models.

So how can we actually capture our experience and share with our academic colleagues so that we can all think about these problems together? So now I think we have some understanding. Again, now this is distilling the longer-term research questions and getting our teams to focus on those.

Llorens: You know, another important phase of the research lifecycle or the system lifecycle is the test and evaluation. So you design a system; you conceptualize it; you develop it. At some point, you know—put some mitigations in place, perhaps like the ones you suggested. Um, at some point, then you have to test it. How does that happen, uh, with these, with this kind of a system, this kind of general-purpose system?

Kamar: Yeah. So, you know, just thinking about traditional machine learning, testing was always a very core part of the way we built machine learning. You would collect some data, you would make part of that data training and you would have part of that data as test set, and then you would have a call to measure for every model you’re building from, from Day 1.

That is no longer the case with these generative models, especially as we get into this “prompt something and you have your application development” culture. There are really big questions about how we evaluate these models. The insight there is that because these models are generative, they can also be used for creating test data. So on the topic of hallucination, for example, we have been using GPT-4 for simulating dialogues fed by, um, queries, common queries, and also get the model to check if some certain risks like hallucinations are happening.

So this is giving us a partly automated, GPT-4–powered evaluation pipeline that, of course, needs to have human eyes on it because not everything the machine generates or validates is always correct. But this gives us a loop to be able to generate data at scale and do evaluation. But, of course, not all problems are equally vital for our society.

There are certain things that carry a lot more weight than others. For example, even on the topic of hallucinations, if a search engine is providing wrong guidance on a critical health query, that is a much bigger risk. So this is why another important part of the evaluation is red teaming. How can we bring human eyes onto the system in the most critical ways and actually get them to check what the systems are doing?

So again, we are at the early days of figuring out what evaluation is going to look like for this new generation of models. Again, human-AI partnership is going to play a key role in the way we evaluate these systems. We see that generative capabilities of these models are powerful for creating data. Human eyes are always going to be important as the final checkers of what is right and what is wrong.

And we just need to build these techniques and make them part of the way we build AI systems with these latest models.

Llorens: I want to ask you about a term, uh, the term agent. Um, you, you kind of referenced it earlier, but I want to come back to it, and I want to come back to it in the context of what your vision for the future is for, I’ll say, AI models and systems that we use, that we create from those models.

What is that vision, and what does that vision have to do with agents?

Kamar: You know, the word agent comes from agency, and the question is what does agency mean for an AI system? It is the fact that they are aware, they can act, and they can learn. So those are the three main capabilities we want to have in our AI systems. Just to take a bit deeper into this: being aware—again, we are building these agents not to act independently in the world. We are building them to partner with people and help people with their tasks. So when we talk about them being aware, we are talking about being aware of their users, being aware of their needs, being aware of their goals, and also being aware of the information on the world so that they don’t have to start from scratch. The other part is action—taking action on behalf of their users.

And here I think we are going to see a lot more interesting scenarios going forward in terms of what the AI systems can do in partnership with people. Right now, we are seeing writing documents, collecting information from the web, and presenting them, but in the future, what other creative things AI systems and humans can do together?

What other tasks that you just don’t want to do and you want the AI to take over with your accountability and control, of course. So that’s the part of the acting we need to figure out. And the other part that is very important is learning. We talked about GitHub Copilot, which is a wonderful AI application that so many people are getting value in the world.

At the same time, we are not only talking about GitHub Copilot getting better at code completion; we are talking about GitHub Copilot getting better in terms of providing value for people. So in terms of like getting better, we have to figure out what does that human-centered reward we can provide to these AI systems just in terms of the value people get—what has been good, what has been bad—and use that reward signal to teach the machine how to act better in the world. Those are all part of the framework we have for this AI agent. And just to reiterate, this is always going to have these very powerful models as a building block. But as you can imagine, we will need other components to get there.

[MUSIC]

Llorens: Thanks, Ece. Well, I’m certainly excited by the technologies we have today, and I’m excited for the vision that you’ve articulated for the future. So, yeah, really appreciate you sharing that vision with us today, and thanks for spending the time.

Kamar: Thank you.

The post AI Frontiers: Models and Systems with Ece Kamar appeared first on Microsoft Research.

Read More

Research Focus: Week of April 10, 2023

Research Focus: Week of April 10, 2023

Microsoft Research Focus 13 edition, week of April 10, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs

To improve the utilization of computing resources, cloud providers often offer underutilized capacity at a discount, but with lower guarantees of availability. However, many customers hesitate to take full advantage of such offerings (such as spot virtual machines), even though they can provide scalability and lower costs for workloads that can handle interruptions.

In a new paper: Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs,
researchers from Microsoft propose an intelligent framework to optimize customer cost while maintaining resource availability by dynamically mixing on-demand VMs with spot VMs. Snape is composed with a reliable model for predicting the eviction rate of spot VMs from the production trace and an intelligent constrained reinforcement learning (CRL) framework for learning the best mixture policy, given the predicted eviction rate and other service signals. 

This proactive design enables an online decision-making system for dynamically adjusting the mixture of on-demand and spot VMs and ensures that a more aggressive and cheaper policy is only adopted when the reliability is high (low predicted eviction rates of spot VM). Experiments across different configurations show that Snape achieves 44% savings compared to the policy of using only on-demand VMs, and at the same time, maintains 99.96% availability—2.77% higher than with a policy of using only spot VMs. 

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

NEW RESEARCH 

Embracing Noise: How can systems be designed and created with and for noise? 

Noise—as a term used to describe data as not meaningful or useful to a system—is a helpful concept in fields like data science, machine learning, and AI. It can help make data manageable, for example by allowing “noisy” data points to be identified and removed so the data can be streamlined to fit a computational structure. But unlike computer systems, which operate with explicit definitions and discrete structures, people have varying boundaries and perceptions of what is meaningful. This presents choices that involve noise. For example, what specific input will we be expecting and what remaining potential input will be considered noise? What constitutes valid input, and what are the consequences of deciding that something is “invalid”? 

In a new paper: Embracing Data Noise, Microsoft researcher Ida Larsen-Ledet examines conceptualization, acceptance, and use of noise; including what may be gained from viewing seemingly undesirable output as noise with potential. 

When designing computing systems, removing or reducing noise can be the right choice – for example, in safety-critical environments. But noise shouldn’t be uncritically disregarded. If we look at noise in a nuanced way, we may be better able to apply it in useful ways.


NEW RESEARCH

DOTE: Rethinking (Predictive) WAN Traffic Engineering 

Uncertainty about future network traffic trends presents a crucial real-world challenge for routing, especially over wide-area networks where bandwidth is expensive, and applications have stringent quality-of-service requirements. In a new paper, DOTE: Rethinking (Predictive) WAN Traffic Engineering, researchers from Microsoft Research teamed up with researchers from the Hebrew University and the Technion to explore a new design point for traffic engineering on wide-area networks (WANs): directly optimizing traffic flow on the WAN using only historical data. 

The novel algorithmic framework of DOTE combines stochastic optimization and deep learning to identify appropriate routing using as input only historical traffic demands. Intrinsically, the technique picks up on patterns in traffic demands at the scale of large WANs, allowing it to identify high-quality routing without predicting future demands. The research shows this method provably converges to the global optimum in well-studied theoretical models and demonstrates the performance benefits through extensive analyses of empirical data from operational networks, including Microsoft’s backbone network.


OPPORTUNITY 

Predoctoral Research Assistant (contract) – Computational Social Science

Microsoft Research New York City seeks a recent college graduate for a contingent Predoctoral Research Assistant position in computational social science (CSS). Our Predoctoral Research Assistant program is aimed at candidates seeking research experience prior to pursuing a PhD in fields related to CSS. 

Our computational social science group is widely recognized as a leading center of CSS research. Our research lies at the intersection of computer science, statistics, and social sciences, and uses large-scale demographic, behavioral, and network data to investigate human activity and relationships. Apply by May 5 for a one-year assignment beginning in Summer 2023, with a possibility to extend to a total of 18 months. 

The post Research Focus: Week of April 10, 2023 appeared first on Microsoft Research.

Read More

Research Focus: Week of April 10, 2023

Research Focus: Week of April 10, 2023

Microsoft Research Focus 13 edition, week of April 10, 2023

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs

To improve the utilization of computing resources, cloud providers often offer underutilized capacity at a discount, but with lower guarantees of availability. However, many customers hesitate to take full advantage of such offerings (such as spot virtual machines), even though they can provide scalability and lower costs for workloads that can handle interruptions.

In a new paper: Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs,
researchers from Microsoft propose an intelligent framework to optimize customer cost while maintaining resource availability by dynamically mixing on-demand VMs with spot VMs. Snape is composed with a reliable model for predicting the eviction rate of spot VMs from the production trace and an intelligent constrained reinforcement learning (CRL) framework for learning the best mixture policy, given the predicted eviction rate and other service signals. 

This proactive design enables an online decision-making system for dynamically adjusting the mixture of on-demand and spot VMs and ensures that a more aggressive and cheaper policy is only adopted when the reliability is high (low predicted eviction rates of spot VM). Experiments across different configurations show that Snape achieves 44% savings compared to the policy of using only on-demand VMs, and at the same time, maintains 99.96% availability—2.77% higher than with a policy of using only spot VMs. 

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

NEW RESEARCH 

Embracing Noise: How can systems be designed and created with and for noise? 

Noise—as a term used to describe data as not meaningful or useful to a system—is a helpful concept in fields like data science, machine learning, and AI. It can help make data manageable, for example by allowing “noisy” data points to be identified and removed so the data can be streamlined to fit a computational structure. But unlike computer systems, which operate with explicit definitions and discrete structures, people have varying boundaries and perceptions of what is meaningful. This presents choices that involve noise. For example, what specific input will we be expecting and what remaining potential input will be considered noise? What constitutes valid input, and what are the consequences of deciding that something is “invalid”? 

In a new paper: Embracing Data Noise, Microsoft researcher Ida Larsen-Ledet examines conceptualization, acceptance, and use of noise; including what may be gained from viewing seemingly undesirable output as noise with potential. 

When designing computing systems, removing or reducing noise can be the right choice – for example, in safety-critical environments. But noise shouldn’t be uncritically disregarded. If we look at noise in a nuanced way, we may be better able to apply it in useful ways.


NEW RESEARCH

DOTE: Rethinking (Predictive) WAN Traffic Engineering 

Uncertainty about future network traffic trends presents a crucial real-world challenge for routing, especially over wide-area networks where bandwidth is expensive, and applications have stringent quality-of-service requirements. In a new paper, DOTE: Rethinking (Predictive) WAN Traffic Engineering, researchers from Microsoft Research teamed up with researchers from the Hebrew University and the Technion to explore a new design point for traffic engineering on wide-area networks (WANs): directly optimizing traffic flow on the WAN using only historical data. 

The novel algorithmic framework of DOTE combines stochastic optimization and deep learning to identify appropriate routing using as input only historical traffic demands. Intrinsically, the technique picks up on patterns in traffic demands at the scale of large WANs, allowing it to identify high-quality routing without predicting future demands. The research shows this method provably converges to the global optimum in well-studied theoretical models and demonstrates the performance benefits through extensive analyses of empirical data from operational networks, including Microsoft’s backbone network.


OPPORTUNITY 

Predoctoral Research Assistant (contract) – Computational Social Science

Microsoft Research New York City seeks a recent college graduate for a contingent Predoctoral Research Assistant position in computational social science (CSS). Our Predoctoral Research Assistant program is aimed at candidates seeking research experience prior to pursuing a PhD in fields related to CSS. 

Our computational social science group is widely recognized as a leading center of CSS research. Our research lies at the intersection of computer science, statistics, and social sciences, and uses large-scale demographic, behavioral, and network data to investigate human activity and relationships. Apply by May 5 for a one-year assignment beginning in Summer 2023, with a possibility to extend to a total of 18 months. 

The post Research Focus: Week of April 10, 2023 appeared first on Microsoft Research.

Read More

Building toward more autonomous and proactive cloud technologies with AI

Building toward more autonomous and proactive cloud technologies with AI

Vision of AIOps Research with four quadrants (starting in the top left and proceeding clockwise): Autonomous, Proactive, Manageable, Comprehensive

Cloud Intelligence/AIOps blog series

In the first blog post in this series, Cloud Intelligence/AIOps – Infusing AI into Cloud Computing Systems, we presented a brief overview of Microsoft’s research on Cloud Intelligence/AIOps (AIOps), which innovates AI and machine learning (ML) technologies to help design, build, and operate complex cloud platforms and services effectively and efficiently at scale. As cloud computing platforms have continued to emerge as one of the most fundamental infrastructures of our world, both their scale and complexity have grown considerably. In our previous blog post, we discussed the three major pillars of AIOps research: AI for Systems, AI for Customers, and AI for DevOps, as well as the four major research areas that constitute the AIOps problem space: detection, diagnosis, prediction, and optimization. We also envisioned the AIOps research roadmap as building toward creating more autonomous, proactive, manageable, and comprehensive cloud platforms. 

Vision of AIOps Research

Autonomous Proactive Manageable Comprehensive
Fully automate the operation of cloud systems to minimize system downtime and reduce manual efforts. Predict future cloud status, support proactive decision-making, and prevent bad things from happening. Introduce the notion of tiered autonomy for infusing autonomous routine operations and deep human expertise.  Span AIOps to the full cloud stack for global optimization/management and extend to multi-cloud environments.

Starting with this blog post, we will take a deeper dive into Microsoft’s vision for AIOps research and the ongoing efforts to realize that vision. This blog post will focus on how our researchers leveraged state-of-the-art AIOps research to help make cloud technologies more autonomous and proactive. We will discuss our work to make the cloud more manageable and comprehensive in future blog posts.

Autonomous cloud

Motivation

Cloud platforms require numerous actions and decisions every second to ensure that computing resources are properly managed and failures are promptly addressed. In practice, those actions and decisions are either generated by rule-based systems constructed upon expert knowledge or made manually by experienced engineers. Still, as cloud platforms continue to grow in both scale and complexity, it is apparent that such solutions will be insufficient for the future cloud system. On one hand, rigid rule-based systems, while being knowledge empowered, often involve huge numbers of rules and require frequent maintenance for better coverage and adaptability. Still, in practice, it is often unrealistic to keep such systems up to date as cloud systems expand in both size and complexity, and even more difficult to guarantee consistency and avoid conflicts between all the rules. On the other hand, engineering efforts are very time-consuming, prone to errors, and difficult to scale.

Spotlight: Microsoft Research Podcast

AI Frontiers: The Physics of AI with Sébastien Bubeck

What is intelligence? How does it emerge and how do we measure it? Ashley Llorens and machine learning theorist Sébastian Bubeck discuss accelerating progress in large-scale AI and early experiments with GPT-4.

To break the constraints on the coverage and scalability of the existing solutions and improve the adaptability and manageability of the decision-making systems, cloud platforms must shift toward a more autonomous management paradigm. Instead of relying solely on expert knowledge, we need suitable AI/ML models to fuse operational data and expert knowledge together to enable efficient, reliable, and autonomous management decisions. Still, it will take many research and engineering efforts to overcome various barriers for developing and deploying autonomous solutions to cloud platforms.

Toward an autonomous cloud

In the journey towards an autonomous cloud, there are two major challenges. The first challenge lies in the heterogeneity of cloud data. In practice, cloud platforms deploy a huge number of monitors to collect data in various formats, including telemetry signals, machine-generated log files, and human input from engineers and users. And the patterns and distributions of those data generally exhibit a high degree of diversity and are subjected to changes over time. To ensure that the adopted AIOps solutions can function autonomously in such an environment, it is essential to empower the management system with robust and extendable AI/ML models capable of learning useful information from heterogeneous data sources and drawing right conclusions in various scenarios.

The complex interaction between different components and services presents another major challenge in deploying autonomous solutions. While it can be easy to implement autonomous features for one or a few components/services, how to construct end-to-end systems capable of automatically navigating the complex dependencies in cloud systems presents the true challenge for both researchers and engineers. To address this challenge, it is important to leverage both domain knowledge and data to optimize the automation paths in application scenarios. Researchers and engineers should also implement reliable decision-making algorithms in every decision stage to improve the efficiency and stability of the whole end-to-end decision-making process.

Over the past few years, Microsoft research groups have developed many new models and methods for overcoming those challenges and improving the level of automation in various cloud application scenarios across the AIOps problem spaces. Notable examples include:

  • Detection: Gandalf and ATAD for the early detection of problematic deployments; HALO for hierarchical faulty localization; and Onion for detecting incident-indicating logs.
  • Diagnosis: SPINE and UniParser for log parsing; Logic and Warden for regression and incident diagnosis; and CONAN for batch failure diagnosis.
  • Prediction: TTMPred for predicting time to mitigate incidents; LCS for predicting the low-capacity status in cloud servers; and Eviction Prediction for predicting the eviction of spot virtual machines.
  • Optimization: MLPS for optimizing the reallocation of containers; and RESIN for the management of memory leak in cloud infrastructure.

These solutions not only improve service efficiency and reduce management time with more automatous design, but also result in higher performance and reliability with fewer human errors. As an illustration of our work toward a more autonomous cloud, we will discuss our exploration for supporting automatic safe deployment services below.

Exemplary scenario: Automatic safe deployment

In online services, the continuous integration and continuous deployment (CI/CD) of new patches and builds are critical for the timely delivery of bug fixes and feature updates. Because new deployments with undetected bugs or incompatible issues can cause severe service outages and create significant customer impact, cloud platforms enforce strict safe-deployment procedures before releasing each new deployment to the production environments. Such procedures typically involve multi-stage testing and verification in a sequence of canary environments with increasing scopes. When a deployment-related anomaly is identified in one of these stages, the responsible deployment is rolled back for further diagnosis and fixing. Owing to the challenges of identifying deployment-related anomalies with heterogeneous patterns and managing a huge number of deployments, safe-deployment systems administrated manually can be extremely costly and error prone.

To support automatic and reliable anomaly detection in safe deployment, we proposed a general methodology named ATAD for the effective detection of deployment-related anomalies in time-series signals. This method addresses the challenges of capturing changes with various patterns in time-series signals and the lack of labeled anomaly samples due to the heavy cost of labeling. Specifically, this method combines ideas from both transfer learning and active learning to make good use of the temporal information in the input signal and reduce the number of labeled samples required for model training. Our experiments have shown that ATAD can outperform other state-of-the-art anomaly detection approaches, even with only 1%-5% of labeled data.

At the same time, we collaborated with product teams in Azure to develop and deploy Gandalf, an end-to-end automatic safe deployment system that reduces deployment time and increases the accuracy of detecting bad deployment in Azure. As a data-driven system, Gandalf monitors a large array of information, including performance metrics, failure signals and deployment records. It also detects anomalies in various patterns throughout the entire safe-deployment process. After detecting anomalies, Gandalf applies a vote-veto mechanism to reliably determine whether each detected anomaly is caused by a specific new deployment. Gandalf then automatically decides whether the relevant new deployment should be stopped for a fix or if it’s safe enough to proceed to the next stage. After rolling out in Azure, Gandalf has been effective at helping to capture bad deployments, achieving more than 90% precision and near 100% recall in production over a period of 18 months.

Flow of Automatic Safe Deployment System
Flow of Automatic Safe Deployment System

Proactive cloud

Motivation

Traditional decision-making in the cloud focuses on optimizing immediate resource usage and addressing emerging issues. While this reactive design is not unreasonable in a relatively static system, it can lead to short-sighted decisions in a dynamic environment. In cloud platforms, both the demand and utilization of computing resources are undergoing constant changes, including regular periodical patterns, unexpected spikes, and gradual shifts in both temporal and spatial dimensions. To improve the long-term efficiency and reliability of cloud platforms, it is critical to adopt a proactive design that takes the future status of the system into account in the decision-making process.

A proactive design leverages data-driven models to predict the future status of cloud platforms and enable downstream proactive decision-making. Conceptually, a typical proactive decision-making system consists of two modules: a prediction module and a decision-making module, as displayed in the following diagram.

Cloud Platform Prediction Module

In the prediction module, historical data are collected and processed for training and fine-tuning the prediction model for deployment. The deployed prediction model takes in the online data stream and generates prediction results in real time. In the decision-making module, both the current system status and the predicted system status, along with other information such as domain knowledge and past decision history, is considered for making decisions that balance both present and future benefits.

Toward proactive design

Proactive design, while creating new opportunities for improving the long-term efficiency and reliability of cloud systems, does expose the decision-making process to additional risks. On one hand, thanks to the inherent randomness in the daily operation of cloud platforms, proactive decisions are always subjected to the uncertainty risk from the stochastic elements in both running systems and the environments. On the other hand, the reliability of prediction models adds another layer of risks in making proactive decisions. Therefore, to guarantee the performance of proactive design, engineers must put mechanisms in place to address those risks.

To manage uncertainty risk, engineers need to reformulate the decision-making in proactive design to account for the uncertainty elements. They can often use methodological frameworks, such as prediction+optimization and optimization under chance-constraints, to incorporate uncertainties into the target functions of optimization problems. Well-designed ML/AL models can also learn uncertainty from data for improving proactive decisions against uncertainty elements. As for risks associated with the prediction model, modules for improving data quality, including quality-aware feature engineering, robust data imputation, and data rebalancing, should be applied to reduce prediction errors. Engineers should also make continuous efforts to improve and update the robustness of prediction models. Moreover, safeguarding mechanisms are essential to prevent decisions that may cause harm to the cloud system.

Microsoft’s AIOps research has pioneered the transition from reactive decision-making to proactive decision-making, especially in problem spaces of prediction and optimization. Our efforts not only lead to significant improvement in many application scenarios traditionally supported by reactive decision-making, but also create many new opportunities. Notable proactive design solutions include Narya and Nenya for hardware failure mitigation, UAHS and CAHS for the intelligent virtual machine provisioning, CUC for the predictive scheduling of workloads, and UCaC for bin packing optimization under chance constraints. In the discussion below, we will use hardware failure mitigation as an example to illustrate how proactive design can be applied in cloud scenarios.

Exemplary scenario: Proactive hardware failure mitigation

A key threat to cloud platforms is hardware failure, which can cause interruptions to the hosted services and significantly impact the customer experience. Traditionally, hardware failures are only resolved reactively after the failure occurs, which typically involves temporal interruptions of hosted virtual machines and the repair or replacement of impacted hardware. Such a solution provides limited help in reducing negative customer experiences.

Narya is a proactive disk-failure mitigation service capable of taking mitigation actions before failures occur. Specifically, Narya leverages ML models to predict potential disk failures, and then make decisions accordingly. To control risks associated with uncertainty, Narya evaluates candidate mitigation actions based on the estimated impacts to customers and chooses actions with minimum impact. A feedback loop also exists for collecting follow-up assessments to improve prediction and decision modules.

Hardware failures in cloud systems are often highly interdependent. Therefore, to reduce the impact of predictions errors, Narya introduces a novel dependency-aware model to encode the dependency relationship between nodes to improve the failure prediction model. Narya also implements an adaptive approach that uses A/B testing and bandit modeling to improve the ability to estimate the impacts of actions. Several safeguarding mechanisms in different stages of Narya are also in place to eliminate the chance of making unsafe mitigation actions. Implementation of Narya in Azure’s production environment has reduced the node hardware interruption rate for virtual machines by more than 26%.

Narya's Feedback loop

Our recent work, Nenya, is another example for proactive failure mitigation. Under a reinforcement learning framework, Nenya fuses prediction and decision-making modules into an end-to-end proactive decision-making system. It can weigh both mitigation costs and failure rates to better prioritize cost-effective mitigation actions against uncertainty. Moreover, the traditional failure mitigation method usually suffers from data imbalance issues; cases of failure form only a very small portion of all cases, which have mostly healthy situations. Such data imbalance would introduce bias to both the prediction and decision-making process. To address this problem, Nenya adopts a cascading framework to ensure that mitigation decisions are not made with heavy costs. Experiments with Microsoft 365 data sets on database failure have proved that Nenya can reduce both mitigation costs and database failure rates compared with existing methods.

Future work

As management systems become more automated and proactive, it is important to pay special attention to both the safety of cloud systems and the responsibility to cloud customers. The autonomous and proactive decision system will depend heavily on advanced AI/ML models with little manual effort. How to ensure that the decisions made by those approaches are both safe and responsible is an essential question that future work should answer.

The autonomous and proactive cloud relies on the effective data usage and feedback loop across all stages in the management and operation of cloud platforms. On one hand, high-quality data on the status of cloud systems are needed to enable downstream autonomous and proactive decision-making systems. On the other hand, it is important to monitor and analyze the impact of each decision on the entire cloud platform in order to improve the management system. Such feedback loops can exist simultaneously for many related application scenarios. Therefore, to better support an autonomous and proactive cloud, a unified data plane responsible for the processing and feedback loop can take a central role in the whole system design and should be a key area of investment.

As such, the future of cloud relies not only on adopting more autonomous and proactive solutions, but also on improving the manageability of cloud systems and the comprehensive infusion of AIOps technologies over all stacks of cloud systems. In future blog posts, we will discuss how to work toward a more manageable and comprehensive cloud.

Stay tuned!

The post Building toward more autonomous and proactive cloud technologies with AI appeared first on Microsoft Research.

Read More