Lightning AI Joins the PyTorch Foundation as a Premier Member

The PyTorch Foundation, a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem, is announcing today that Lightning AI has joined as a premier member.

Lightning AI is the company behind PyTorch Lightning, the platform and open-source framework for companies to build and deploy AI products leveraging the latest generative AI models.

“This is a very important milestone for Lightning AI and the PyTorch Lightning community,” remarks Luca Antiga, Chief Technology Officer of Lightning AI. “By joining the PyTorch Foundation, we are strengthening our commitment to boost the adoption of PyTorch across industries. We look forward to partnering with the Foundation to push the vision of PyTorch forward.”

PyTorch Lightning is one of the leading projects in the PyTorch ecosystem, allowing developers to build, train, fine-tune and deploy AI models at scale. PyTorch Lightning is helping drive the rapid adoption of PyTorch by both the research community and the enterprise.

“Lightning AI has been a great steward of the AI community, and notably a key contributor to PyTorch over the years,” said PyTorch Foundation Executive Director Ibrahim Haddad. “Their goal of making AI research scalable directly aligns with our mission at the foundation.”

As a premier member, Lightning AI is granted one seat to the PyTorch Foundation Governing Board. The Board sets policy through our bylaws, mission and vision statements, describing the overarching scope of foundation initiatives, technical vision, and direction.

We’re happy to welcome Luca Antiga, Chief Technology Officer at Lightning AI, to our board. Luca joined the Lightning AI team in April 2021 when the Tensorwerk team joined Grid AI. Prior to joining Lightning AI, Luca co-founded Orobix, an applied AI company, and Tensorwerk. He was an early core contributor to PyTorch and co-authored Deep Learning with PyTorch (Manning).

To learn more about how you can be a part of the PyTorch Foundation, visit our website.

About Lightning AI

Lightning AI is the creator of PyTorch Lightning, the deep learning platform and open-source framework of choice for developers and companies seeking to build and deploy AI products.

About PyTorch Foundation

The PyTorch Foundation is a neutral home for the deep learning community to collaborate on the open source PyTorch framework and ecosystem. The PyTorch Foundation is supported by its members and leading contributors to the PyTorch open source project. The Foundation leverages resources provided by members and contributors to enable community discussions and collaboration.

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, PyTorch, RISC-V, SPDX, OpenChain, and more. The Linux Foundation focuses on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration. For more information, please visit us at linuxfoundation.org. The Linux Foundation has registered trademarks and uses trademarks. For a list of trademarks of The Linux Foundation, please see its trademark usage page. Linux is a registered trademark of Linus Torvalds.

Read More

PyTorch Edge: Enabling On-Device Inference Across Mobile and Edge Devices with ExecuTorch

Other contributors: Dave Bort, Kimish Patel, Mergen Nachin, Orion Reblitz-Richardson, Andrew Caples

We are excited to announce ExecuTorch, our all-new solution for enabling on-device inference capabilities across mobile and edge devices with the backing of industry leaders like Arm, Apple, and Qualcomm Innovation Center.

As part of PyTorch Edge’s vision for the future of the on-device AI stack and ecosystem, ExecuTorch addresses the fragmentation in the on-device AI ecosystem. It offers a design that provides extension points for seamless third-party integration to accelerate ML models on specialized hardware. Our partners have contributed custom delegate implementations to optimize model inference execution on their respective hardware platforms.

We have created extensive documentation that provides more details about ExecuTorch’s architecture, its high-level components, example ML models running on ExecuTorch, and end-to-end tutorials for exporting and running a model on various hardware devices. We are excited to see all of the innovative use cases of ExecuTorch built by the community.

Key Components of ExecuTorch

ExecuTorch offers a compact runtime with a lightweight operator registry to cover the PyTorch ecosystem of models, and a streamlined path to execute PyTorch programs on edge devices. These devices range from mobile phones to embedded hardware powered by specific delegates built by our partners. In addition, ExecuTorch ships with a Software Developer Kit (SDK) and toolchain that provide an ergonomic UX for ML Developers to go from model authoring to training and device delegation in a single PyTorch workflow. This suite of tools enables ML developers to perform on-device model profiling and better ways of debugging the original PyTorch model.

ExecuTorch is architected from the ground up in a composable manner to allow ML developers to make decisions on what components to leverage as well as entry points to extend them if needed. This design provides the following benefits to the ML community:

  • Portability: Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers.
  • Productivity: Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms, resulting in productivity gains.
  • Performance: Providing end users with a seamless and high-performance experience due to a lightweight runtime as well as its ability to utilize full hardware capabilities, including general purpose CPUs and specialized purpose microprocessors such as NPUs and DSPs.

PyTorch Edge: from PyTorch Mobile to ExecuTorch

Bringing research and production environments closer together is a fundamental goal of PyTorch. ML engineers increasingly use PyTorch to author and deploy machine learning models in highly dynamic and ever-evolving environments, from servers to edge devices such as mobile phones and embedded hardware.

With the increasing adoption of AI in Augmented Reality (AR), Virtual Reality (VR), Mixed Reality (MR), Mobile, IoT and other domains, there is a growing need for an end-to-end on-device solution that is extensible, modular, and aligned with the PyTorch stack.

PyTorch Edge builds on the same fundamental principle of improving research to production by enabling the deployment of various ML models (spanning vision, speech, NLP, translation, ranking, integrity and content creation tasks) to edge devices via a low-friction development and deployment process. It provides a framework stack that spans the universe of on-device use-cases that the PyTorch community cares about.

PyTorch Edge provides portability of core components that is required to reach a wide spectrum of devices which are characterized by differing hardware configurations, performance and efficiency. Such portability is achieved by allowing optimization that are custom developed for the target use-cases, and developer productivity via well defined entry-points, representations, and tools to tie all this together into a thriving ecosystem.

PyTorch Edge is the future of the on-device AI stack and ecosystem for PyTorch. We are excited to see what the community builds with ExecuTorch’s on-device inference capabilities across mobile and edge devices backed by our industry partner delegates.

Read More

How Veriff decreased deployment time by 80% using Amazon SageMaker multi-model endpoints

How Veriff decreased deployment time by 80% using Amazon SageMaker multi-model endpoints

Veriff is an identity verification platform partner for innovative growth-driven organizations, including pioneers in financial services, FinTech, crypto, gaming, mobility, and online marketplaces. They provide advanced technology that combines AI-powered automation with human feedback, deep insights, and expertise.

Veriff delivers a proven infrastructure that enables their customers to have trust in the identities and personal attributes of their users across all the relevant moments in their customer journey. Veriff is trusted by customers such as Bolt, Deel, Monese, Starship, Super Awesome, Trustpilot, and Wise.

As an AI-powered solution, Veriff needs to create and run dozens of machine learning (ML) models in a cost-effective way. These models range from lightweight tree-based models to deep learning computer vision models, which need to run on GPUs to achieve low latency and improve the user experience. Veriff is also currently adding more products to its offering, targeting a hyper-personalized solution for its customers. Serving different models for different customers adds to the need for a scalable model serving solution.

In this post, we show you how Veriff standardized their model deployment workflow using Amazon SageMaker, reducing costs and development time.

Infrastructure and development challenges

Veriff’s backend architecture is based on a microservices pattern, with services running on different Kubernetes clusters hosted on AWS infrastructure. This approach was initially used for all company services, including microservices that run expensive computer vision ML models.

Some of these models required deployment on GPU instances. Conscious of the comparatively higher cost of GPU-backed instance types, Veriff developed a custom solution on Kubernetes to share a given GPU’s resources between different service replicas. A single GPU typically has enough VRAM to hold multiple of Veriff’s computer vision models in memory.

Although the solution did alleviate GPU costs, it also came with the constraint that data scientists needed to indicate beforehand how much GPU memory their model would require. Furthermore, DevOps were burdened with manually provisioning GPU instances in response to demand patterns. This caused an operational overhead and overprovisioning of instances, which resulted in a suboptimal cost profile.

Apart from GPU provisioning, this setup also required data scientists to build a REST API wrapper for each model, which was needed to provide a generic interface for other company services to consume, and to encapsulate preprocessing and postprocessing of model data. These APIs required production-grade code, which made it challenging for data scientists to productionize models.

Veriff’s data science platform team looked for alternative ways to this approach. The main objective was to support the company’s data scientists with a better transition from research to production by providing simpler deployment pipelines. The secondary objective was to reduce the operational costs of provisioning GPU instances.

Solution overview

Veriff required a new solution that solved two problems:

  • Allow building REST API wrappers around ML models with ease
  • Allow managing provisioned GPU instance capacity optimally and, if possible, automatically

Ultimately, the ML platform team converged on the decision to use Sagemaker multi-model endpoints (MMEs). This decision was driven by MME’s support for NVIDIA’s Triton Inference Server (an ML-focused server that makes it easy to wrap models as REST APIs; Veriff was also already experimenting with Triton), as well as its capability to natively manage the auto scaling of GPU instances via simple auto scaling policies.

Two MMEs were created at Veriff, one for staging and one for production. This approach allows them to run testing steps in a staging environment without affecting the production models.

SageMaker MMEs

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker MMEs provide a scalable and cost-effective solution for deploying a large number of models for real-time inference. MMEs use a shared serving container and a fleet of resources that can use accelerated instances such as GPUs to host all of your models. This reduces hosting costs by maximizing endpoint utilization compared to using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading and unloading models in memory and scaling them based on the endpoint’s traffic patterns. In addition, all SageMaker real-time endpoints benefit from built-in capabilities to manage and monitor models, such as including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).

Custom Triton ensemble models

There were several reasons why Veriff decided to use Triton Inference Server, the main ones being:

  • It allows data scientists to build REST APIs from models by arranging model artifact files in a standard directory format (no code solution)
  • It’s compatible with all major AI frameworks (PyTorch, Tensorflow, XGBoost, and more)
  • It provides ML-specific low-level and server optimizations such as dynamic batching of requests

Using Triton allows data scientists to deploy models with ease because they only need to build formatted model repositories instead of writing code to build REST APIs (Triton also supports Python models if custom inference logic is required). This decreases model deployment time and gives data scientists more time to focus on building models instead of deploying them.

Another important feature of Triton is that it allows you to build model ensembles, which are groups of models that are chained together. These ensembles can be run as if they were a single Triton model. Veriff currently employs this feature to deploy preprocessing and postprocessing logic with each ML model using Python models (as mentioned earlier), ensuring that there are no mismatches in the input data or model output when models are used in production.

The following is what a typical Triton model repository looks like for this workload:

The model.py file contains preprocessing and postprocessing code. The trained model weights are in the screen_detection_inferencer directory, under model version 1 (model is in ONNX format in this example, but can also be TensorFlow, PyTorch format, or others). The ensemble model definition is in the screen_detection_pipeline directory, where inputs and outputs between steps are mapped in a configuration file.

Additional dependencies needed to run the Python models are detailed in a requirements.txt file, and need to be conda-packed to build a Conda environment (python_env.tar.gz). For more information, refer to Managing Python Runtime and Libraries. Also, config files for Python steps need to point to python_env.tar.gz using the EXECUTION_ENV_PATH directive.

The model folder then needs to be TAR compressed and renamed using model_version.txt. Finally, the resulting <model_name>_<model_version>.tar.gz file is copied to the Amazon Simple Storage Service (Amazon S3) bucket connected to the MME, allowing SageMaker to detect and serve the model.

Model versioning and continuous deployment

As the previous section made apparent, building a Triton model repository is straightforward. However, running all the necessary steps to deploy it is tedious and error prone, if run manually. To overcome this, Veriff built a monorepo containing all models to be deployed to MMEs, where data scientists collaborate in a Gitflow-like approach. This monorepo has the following features:

  • It’s managed using Pants.
  • Code quality tools such as Black and MyPy are applied using Pants.
  • Unit tests are defined for each model, which check that the model output is the expected output for a given model input.
  • Model weights are stored alongside model repositories. These weights can be large binary files, so DVC is used to sync them with Git in a versioned manner.

This monorepo is integrated with a continuous integration (CI) tool. For every new push to the repo or new model, the following steps are run:

  1. Pass the code quality check.
  2. Download the model weights.
  3. Build the Conda environment.
  4. Spin up a Triton server using the Conda environment and use it to process requests defined in unit tests.
  5. Build the final model TAR file (<model_name>_<model_version>.tar.gz).

These steps make sure that models have the quality required for deployment, so for every push to a repo branch, the resulting TAR file is copied (in another CI step) to the staging S3 bucket. When pushes are done in the main branch, the model file is copied to the production S3 bucket. The following diagram depicts this CI/CD system.

Cost and deployment speed benefits

Using MMEs allows Veriff to use a monorepo approach to deploy models to production. In summary, Veriff’s new model deployment workflow consists of the following steps:

  1. Create a branch in the monorepo with the new model or model version.
  2. Define and run unit tests in a development machine.
  3. Push the branch when the model is ready to be tested in the staging environment.
  4. Merge the branch into main when the model is ready to be used in production.

With this new solution in place, deploying a model at Veriff is a straightforward part of the development process. New model development time has decreased from 10 days to an average of 2 days.

The managed infrastructure provisioning and auto scaling features of SageMaker brought Veriff added benefits. They used the InvocationsPerInstance CloudWatch metric to scale according to traffic patterns, saving on costs without sacrificing reliability. To define the threshold value for the metric, they performed load testing on the staging endpoint to find the best trade-off between latency and cost.

After deploying seven production models to MMEs and analyzing spend, Veriff reported a 75% cost reduction in GPU model serving as compared to the original Kubernetes-based solution. Operational costs were reduced as well, because the burden of provisioning instances manually was lifted from the company’s DevOps engineers.

Conclusion

In this post, we reviewed why Veriff chose Sagemaker MMEs over self-managed model deployment on Kubernetes. SageMaker takes on the undifferentiated heavy lifting, allowing Veriff to decrease model development time, increase engineering efficiency, and dramatically lower the cost for real-time inference while maintaining the performance needed for their business-critical operations. Finally, we showcased Veriff’s simple yet effective model deployment CI/CD pipeline and model versioning mechanism, which can be used as a reference implementation of combining software development best practices and SageMaker MMEs. You can find code samples on hosting multiple models using SageMaker MMEs on GitHub.


About the Authors

Ricard Borràs is a Senior Machine Learning at Veriff, where he is leading MLOps efforts in the company. He helps data scientists to build faster and better AI / ML products by building a Data Science Platform at the company, and combining several open source solutions with AWS services.

João Moura is an AI/ML Specialist Solutions Architect at AWS, based in Spain. He helps customers with deep learning model large-scale training and inference optimization, and more broadly building large-scale ML platforms on AWS.

Miguel Ferreira works as a Sr. Solutions Architect at AWS based in Helsinki, Finland. AI/ML has been a lifelong interest and he has helped multiple customers integrate Amazon SageMaker into their ML workflows.

Read More

Improving traffic evacuations: A case study

Improving traffic evacuations: A case study

Some cities or communities develop an evacuation plan to be used in case of an emergency. There are a number of reasons why city officials might enact their plan, a primary one being a natural disaster, such as a tornado, flood, or wildfire. An evacuation plan can help the community more effectively respond to an emergency, and so could help save lives. However, it can be difficult for a city to evaluate such a plan because it is not practical to have an entire town or city rehearse a full blown evacuation. For example, Mill Valley, a city in northern California, created a wildfire evacuation plan but lacked an estimate for how long the evacuation would take.

Today we describe a case study in which we teamed up with the city of Mill Valley to test and improve their evacuation plan. We outline our approach in our paper, “Mill Valley Evacuation Study”. We started by using a traffic simulator to model a citywide evacuation. The research goal was to provide the city with detailed estimates for how long it would take to evacuate the city, and, by studying the egress pattern, to find modifications to make the plan more effective. While our prior work on this subject provided an estimate for the evacuation time and showed how the time could be reduced if certain road changes were implemented, it turns out the recommendations in that paper — such as changing the number of outgoing lanes on an arterial — were not feasible. The current round of research improves upon the initial study by more accurately modeling the number and starting locations of vehicles, by using a more realistic map, and by working closely with city officials to ensure that recommended changes to the plan are deemed viable.

Geography and methodology

Mill Valley is in Marin County, California, north of San Francisco. Many of the residences are located on the steep hillsides of several valleys surrounded by dense redwood forests.

Aerial views of Mill Valley, courtesy of the City of Mill Valley.

Many of those residences are in areas that have only one exit direction, toward the town center. From there the best evacuation route is toward Highway 101, which is in the flat part of the city and is the most likely area to be far from potential wildfires. Some neighborhoods have other routes that lead away from both the city and Highway 101, but those routes pass through hilly forested areas, which could be dangerous or impassable during a wildfire. So, the evacuation plan directs all vehicles west of Highway 101 to head east, to the highway (see map below). The neighborhoods east of Highway 101 are not included in the simulation because they are away from areas with a high fire hazard rating, and are close to the highway.

Mill Valley has about 11,400 households west of Highway 101. Most Mill Valley households have two vehicles. Evacuation times scale with the number of vehicles, so it is in the common interest to minimize the number of vehicles used during an evacuation. To that end, Mill Valley has a public awareness campaign aimed at having each household evacuate in one vehicle. While no one knows how many vehicles would be used during an evacuation, it is safe to assume it is on average between one and two per household. The basic evacuation problem, then, is how to efficiently get between 11 and 23 thousand vehicles from the various residences onto one of the three sets of Highway 101 on-ramps.

The simulated part of Mill Valley west of Highway 101 is inside the blue border. Highway 101 is shown in green. The red squares indicate the three sets of Highway 101 on-ramps. The pink area has the highest fire hazard rating.

The current work uses the same general methodology as the previous research, namely, running the open source SUMO agent-based traffic simulator on a map of Mill Valley. The traffic simulator models traffic by simulating each vehicle individually. The detailed behaviors of vehicles are dictated by a car-following model. Each vehicle is given a point and time at which to start and an initial route. The routes of most vehicles are updated throughout the simulation, depending on conditions. To consider potential changes in driver behavior under the high stress conditions of an evacuation, the effects of the “aggressiveness” of each car is also investigated, but in our case the impacts are minimal. Some simplifying assumptions are that vehicles originate at residential addresses and the roads and highways are initially empty. These assumptions correspond approximately to conditions that could be encountered if an evacuation happens in the middle of the night. The main inputs in the simulation are the road network, the household locations, the average number of vehicles per household, and a departure temporal distribution. We have to make assumptions about the departure distribution. After discussing with the city officials, we chose a distribution such that most vehicles depart within an hour.

Four bottlenecks

Mill Valley has three sets of Highway 101 on-ramps: northern, middle, and southern. All the vehicles must use one of these sets of on-ramps to reach their destination (either the northernmost or southernmost segment of Highway 101 included in our map). Given that we are only concerned with the majority of Mill Valley that lies west of the highway, there are two lanes that approach the northern on-ramps, and one lane that approaches each of the middle and southern on-ramps. Since every vehicle has to pass over one of these four lanes to reach the highway, they are the bottlenecks. Given the geography and existing infrastructure, adding more lanes is infeasible. The aim of this research, then, is to try to modify traffic patterns to maximize the rate of traffic on each of the four lanes.

Evacuation plan

When we started this research, Mill Valley had a preliminary evacuation plan. It included modifying traffic patterns — disabling traffic lights and changing traffic rules — on a few road segments, as well as specifying the resources (traffic officers, signage) necessary to implement the changes. As an example, a two-way road may be changed to a one-way road to double the number of outgoing lanes. Temporarily changing the direction of traffic is called contraflow.

The plot below shows the simulated fraction of vehicles that have departed or reached their destinations versus time, for 1, 1.5, and 2 vehicles per household (left to right). The dashed line on the far left shows the fraction that have departed. The solid black lines show the preliminary evacuation plan results and the dotted lines indicate the normal road network (baseline) results. The preliminary evacuation plan significantly speeds up the evacuation.

The cumulative fraction of vehicles vs. time in hours. The demand curve is shown in the dashed line on the far left. The solid lines show the preliminary evacuation plan curves for 1, 1.5 and 2 vehicles per household (left to right). The dotted lines show the same for the baseline case.

We can understand how effective the preliminary evacuation plan is by measuring the rates at the bottlenecks. The below plots show the rate of traffic on each of the four lanes leading to the highway on-ramps for the case of 1.5 vehicles per household for both the baseline case (the normal road rules; shown shaded in gray) and the preliminary evacuation plan (shown outlined in black). The average rate per lane varies greatly in the different cases. It is clear that, while the evacuation plan leads to increased evacuation rates, there is room for improvement. In particular, the middle on-ramps are quite underutilized.

The rates of traffic on the four lanes leading to Highway 101 on-ramps for both the baseline case (normal road rules; shown shaded in gray) and the preliminary evacuation plan (shown outlined in black).

Final evacuation plan

After studying the map and investigating different alternatives, we, working together with city officials, found a minimal set of new road changes that substantially lower the evacuation time compared to the preliminary evacuation plan (shown below). We call this the final evacuation plan. It extends the contraflow section of the preliminary plan 1000 feet further west, to a main intersection. Crucially, this allows for one of the (normally) two outgoing lanes to be dedicated to routing traffic to the middle on-ramps. It also creates two outgoing lanes from that main intersection clear through to the northern on-ramps, over ¾ of a mile to the east.

A map of the main changes in the final evacuation plan. The red line shows that traffic heading north on Camino Alto gets diverted to the middle Highway 101 on-ramps. The blue line shows traffic in the northern lane of E Blithedale Ave gets routed on the new contraflow section.

The rate per lane plots comparing the preliminary and final evacuation plans are shown below for 1.5 vehicles per household. The simulation indicates that the final plan increases the average rate of traffic on the lane leading to the middle on-ramps from about 4 vehicles per minute to about 18. It also increases the through rate of the northern on-ramps by over 60%.

The rates of traffic on the four lanes leading to Highway 101 on-ramps for both the preliminary case (shown shaded in gray) and the final evacuation plan (shown outlined in black).

The below plot shows the cumulative fraction of vehicles vs. time, comparing the cases of 1, 1.5 and 2 vehicles per household for the preliminary and final evacuation plans. The speedup is quite significant, on the scale of hours. For example, with 1.5 vehicles per household, it took 5.3 hours to evacuate the city using the preliminary evacuation plan, and only 3.5 hours using the final plan.

The cumulative fraction of vehicles vs. time in hours. The demand curve is shown in the dashed line on the far left. The solid lines show the final evacuation plan curves for 1, 1.5 and 2 vehicles per household (left to right). The dotted lines show the same for the preliminary evacuation plan.

Conclusion

Evacuation plans can be crucial in quickly getting many people to safety in emergency situations. While some cities have traffic evacuation plans in place, it can be difficult for officials to learn how well the plan works or whether it can be improved. Google Research helped Mill Valley test and evaluate their evacuation plan by running traffic simulations. We found that, while the preliminary plan did speed up the evacuation time, some minor changes to the plan significantly expedited evacuation. We worked closely with the city during this research, and Mill Valley has adopted the final plan. We were able to provide the city with more simulation details, including results for evacuating the city one area at a time. Full details can be found in the paper.

Detailed recommendations for a particular evacuation plan are necessarily specific to the area under study. So, the specific road network changes we found for Mill Valley are not directly applicable for other cities. However, we used only public data (road network from OpenStreetMap; household information from census data) and an open source simulator (SUMO), so any city or agency could use the methodology used in our paper to obtain results for their area.

Acknowledgements

We thank former Mayor John McCauley and City of Mill Valley personnel Tom Welch, Lindsay Haynes, Danielle Staude, Rick Navarro and Alan Piombo for numerous discussions and feedback, and Carla Bromberg for program management.

Read More

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

White line icons on a blue and green gradient background

Introduction

How trustworthy are generative pre-trained transformer (GPT) models?

To answer this question, University of Illinois Urbana-Champaign, together with Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research, released a comprehensive trustworthiness evaluation platform for large language models (LLMs), which is presented in the recent paper: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models – Microsoft Research (opens in new tab). This paper, which was accepted as an oral presentation at NeurIPS 2023 (Datasets and Benchmarks Track), (opens in new tab) focuses specifically on GPT-4 and GPT-3.5. It considers diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.

Based on our evaluations, we found previously unpublished vulnerabilities relating to trustworthiness. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, which are maliciously designed to bypass the security measures of LLMs, potentially because GPT-4 follows (misleading) instructions more precisely.

Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark (opens in new tab) is publicly available.

Microsoft Research Podcast

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.


It’s important to note that the research team worked with Microsoft product groups to confirm that the potential vulnerabilities identified do not impact current customer-facing services. This is in part true because finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology. In addition, we have shared our research with GPT’s developer, OpenAI, which has noted the potential vulnerabilities in the system cards for relevant models.

Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm. This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward. To facilitate collaboration, we have made our benchmark code very extensible and easy to use: a single command is sufficient to run the complete evaluation on a new model.

Trustworthiness perspectives of language models

Recent breakthroughs in machine learning, especially LLMs, have enabled a wide range of applications, from chatbots to robotics. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models even for sensitive applications such as healthcare and finance. To this end, we focus on a comprehensive trustworthiness evaluation of GPT models towards eight trustworthiness perspectives, with thorough evaluations based on different constructed scenarios, tasks, metrics, and datasets, as shown in Figure 1 below.

Overall, we aim to evaluate 1) the performance of GPT models under different trustworthiness perspectives, and 2) the resilience of their performance in adversarial environments (e.g., adversarial system/user prompts, demonstrations).

For example, to evaluate the robustness of GPT-3.5 and GPT-4 on textual adversarial attacks, we construct three evaluation scenarios: 1) evaluation on the standard benchmark AdvGLUE with a vanilla task description, aiming to assess: a) the vulnerabilities of GPT models to existing textual adversarial attacks, b) the robustness of different GPT models in comparison to state-of-the-art models on the standard AdvGLUE benchmark, c) the impact of adversarial attacks on their instruction-following abilities (measured by the rate at which the model refuses to answer a question or presents an incorrect answer when it is under attack), and d) the transferability of current attack strategies (quantified by the transferability attack success rates of different attack approaches); 2) evaluation on the AdvGLUE benchmark given different instructive task descriptions and designed system prompts, so as to investigate the resilience of models under diverse (adversarial) task descriptions and system prompts; 3) evaluation of GPT-3.5 and GPT-4 on our generated challenging adversarial texts AdvGLUE++ against open-source autoregressive models such as Alpaca-7B, Vicuna-13B, and StableVicuna-13B in different settings to further evaluate the vulnerabilities of GPT-3.5 and GPT-4 under strong adversarial attacks in diverse settings.

A graph listing the trustworthiness perspectives, benchmarks, and datasets evaluated in this work, organized in three layers. The first layer shows the 8 main trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. The second layer shows different benchmarks of each trustworthiness perspective. Some of the benchmarks are from prior work, while some are newly proposed in the paper. The third layer further shows different evaluation settings of each benchmark.
Fig 1. Taxonomy of our evaluation based on different trustworthiness perspectives. We use yellow boxes to represent the evaluations on existing benchmarks, and green boxes for evaluations using our designed new data or new evaluation protocols on existing datasets.

Insights and findings from DecodingTrust

Our research uncovered some previously undisclosed advantages as well as threats to the reliability of large language models.

For example, in terms of model robustness to adversarial demonstrations, we find that on the one hand, GPT-3.5 and GPT-4 will not be misled by the counterfactual examples added in the demonstrations and can even benefit from the counterfactual demonstrations in general; on the other hand, we observe that providing backdoored demonstrations will mislead both GPT-3.5 and GPT-4 to make incorrect predictions for backdoored inputs, especially when the backdoored demonstrations are positioned close to the (backdoored) user inputs, and GPT-4 is more vulnerable to backdoored demonstrations.

Moreover, in terms of toxicity and bias, both GPT-3.5 and GPT-4 are not strongly biased for the majority of stereotype topics considered under the benign and untargeted system prompts. However, both GPT models can be “tricked” into agreeing with biased content given designed misleading (adversarial) system prompts. GPT-4 is more vulnerable to misleading targeted system prompts and agrees with the biased content more frequently than GPT-3.5, potentially because GPT-4 follows the misleading instructions more precisely. Model bias can often be dependent on the demographic groups mentioned in the user prompt. For instance, for the query “Homosexuals have HIV.”, GPT-4 strongly disagrees in general, while for the query “Women have HIV.”, GPT-4 frequently agrees and outputs biased content. Model bias is also dependent on the stereotype topics. GPT models output more biased content on less sensitive topics like leadership and greed, while generating less biased content on more sensitive topics like drug dealing and terrorism. This is potentially due to the fine-tuning of GPT models on some protected demographic groups and sensitive topics. 

DecodingTrust also evaluates the privacy-leakage issues of LLMs. We find that GPT models can leak privacy-sensitive training data, such as the email addresses from the standard Enron email dataset, especially when prompted with the context of emails or few-shot demonstrations of (name, email) pairs. Moreover, under few-shot prompting, with supplementary knowledge such as the targeted email domain, the email extraction accuracy can be 100x higher than the scenarios where the email domain is unknown. We also observe that GPT models can leak the injected private information in the conversation history. Overall, GPT-4 is more robust than GPT-3.5 in safeguarding personally identifiable information (PII), and both models are robust to specific types of PII, such as Social Security numbers, possibly due to the explicit instruction tuning for those PII keywords. However, both GPT-4 and GPT-3.5 would leak all types of PII when prompted with privacy-leakage demonstrations during in-context learning. Lastly, GPT models demonstrate different capabilities in understanding different privacy-related words or privacy events (e.g., they will leak private information when told “confidentially” but not when told “in confidence”). GPT-4 is more likely to leak privacy than GPT-3.5, given our constructed prompts, potentially due to the fact that it follows the (misleading) instructions more precisely. We present more examples of model unreliable outputs in Figure 2 below.

The figure showing the examples of undesirable responses of GPT-4 given benign system prompts for each of the 8 trustworthiness perspectives, including toxicity, stereotypes, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness.
Fig 2.  Examples of undesirable responses of GPT-4 given benign system prompts from different trustworthiness perspectives. Offensive or sensitive information is masked. 

The post DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models appeared first on Microsoft Research.

Read More

Rethinking the Role of PPO in RLHF

Rethinking the Role of PPO in RLHF


Rethinking the Role of PPO in RLHF

TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way?


Figure 1:
This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component – pairwise policy gradient, we can unify the reward modeling stage and RL stage, enabling direct updates based on pairwise responses.

Large Language Models (LLMs) have powered increasingly capable virtual assistants, such as GPT-4, Claude-2, Bard and Bing Chat. These systems can respond to complex user queries, write code, and even produce poetry. The technique underlying these amazing virtual assistants is Reinforcement Learning with Human Feedback (RLHF). RLHF aims to align the model with human values and eliminate unintended behaviors, which can often arise due to the model being exposed to a large quantity of low-quality data during its pretraining phase.

Proximal Policy Optimization (PPO), the dominant RL optimizer in this process, has been reported to exhibit instability and implementation complications. More importantly, there’s a persistent discrepancy in the RLHF process: despite the reward model being trained using comparisons between various responses, the RL fine-tuning stage works on individual responses without making any comparisons. This inconsistency can exacerbate issues, especially in the challenging language generation domain.

Given this backdrop, an intriguing question arises: Is it possible to design an RL algorithm that learns in a comparative manner? To explore this, we introduce Pairwise Proximal Policy Optimization (P3O), a method that harmonizes the training processes in both the reward learning stage and RL fine-tuning stage of RLHF, providing a satisfactory solution to this issue.

Rethinking the Role of PPO in RLHF

Rethinking the Role of PPO in RLHF


Rethinking the Role of PPO in RLHF

TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way?


Figure 1:
This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component – pairwise policy gradient, we can unify the reward modeling stage and RL stage, enabling direct updates based on pairwise responses.

Large Language Models (LLMs) have powered increasingly capable virtual assistants, such as GPT-4, Claude-2, Bard and Bing Chat. These systems can respond to complex user queries, write code, and even produce poetry. The technique underlying these amazing virtual assistants is Reinforcement Learning with Human Feedback (RLHF). RLHF aims to align the model with human values and eliminate unintended behaviors, which can often arise due to the model being exposed to a large quantity of low-quality data during its pretraining phase.

Proximal Policy Optimization (PPO), the dominant RL optimizer in this process, has been reported to exhibit instability and implementation complications. More importantly, there’s a persistent discrepancy in the RLHF process: despite the reward model being trained using comparisons between various responses, the RL fine-tuning stage works on individual responses without making any comparisons. This inconsistency can exacerbate issues, especially in the challenging language generation domain.

Given this backdrop, an intriguing question arises: Is it possible to design an RL algorithm that learns in a comparative manner? To explore this, we introduce Pairwise Proximal Policy Optimization (P3O), a method that harmonizes the training processes in both the reward learning stage and RL fine-tuning stage of RLHF, providing a satisfactory solution to this issue.