Content moderation design patterns with AWS managed AI services

User-generated content (UGC) grows exponentially, as well as the requirements and the cost to keep content and online communities safe and compliant. Modern web and mobile platforms fuel businesses and drive user engagement through social features, from startups to large organizations. Online community members expect safe and inclusive experiences where they can freely consume and contribute images, videos, text, and audio. The ever-increasing volume, variety, and complexity of UGC make traditional human moderation workflows challenging to scale to protect users. These limitations force customers into inefficient, expensive, and reactive mitigation processes that carry an unnecessary risk for users and the business. The result is a poor, harmful, and non-inclusive community experience that disengages users, negatively impacting community, and business objectives.

The solution is scalable content moderation workflows that rely on artificial intelligence (AI), machine learning (ML), deep learning (DL), and natural language processing (NLP) technologies. These constructs translate, transcribe, recognize, detect, mask, redact, and strategically bring human talent into the moderation workflow, to run the actions needed to keep users safe and engaged while increasing accuracy and process efficiency, and lowering operational costs.

This post reviews how to build content moderation workflows using AWS AI services. To learn more about business needs, impact, and cost reductions that automated content moderation brings to social media, gaming, e-commerce, and advertising industries, see Utilize AWS AI services to automate content moderation and compliance.

Solution overview

You don’t need expertise in ML to implement these workflows and can tailor these patterns to your specific business needs! AWS delivers these capabilities through fully managed services that remove operational complexity and undifferentiated heavy lifting, and without a data science team.

In this post, we demonstrate how to efficiently moderate spaces where customers discuss and review products using text, audio, images, video, and even PDF files. The following diagram illustrates the solution architecture.

Abstract diagram showing how AWS AI services come together.

Prerequisites

By default, these patterns demonstrate a serverless methodology, where you only pay for what you use. You continue paying for the compute resources, such as AWS Fargate containers, and storage, such as Amazon Simple Storage Service (Amazon S3), until you delete those resources. The discussed AWS AI services also follow a consumption pricing model per operation.

Non-production environments can test each of these patterns within the Free Tier, assuming your account’s eligibility.

Moderate plain text

First, you need to implement content moderation for plain text. This procedure serves as the foundation for more sophisticated media types and entails two high-level steps:

  1. Translate the text.
  2. Analyze the text.

Global customers want to collaborate with social platforms in their native language. Meeting this expectation can add complexity because design teams must construct a workflow or steps for each language. Instead, you can use Amazon Translate to convert text to over 70 languages and variants in over 15 regions. This capability enables you to write analysis rules for a single language and apply those rules across the global online community.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. You can integrate it into your workflows to detect the dominant language and translate the text. The following diagram illustrates the workflow.

State machine for normalizing text

The APIs operate as follows:

Next, you can use NLP to uncover connections in text, like discovering key phrases, analyzing sentiment, and detecting personally identifiable information (PII). Amazon Comprehend APIs extract those valuable insights and pass them into custom function handlers.

Running those handlers inside AWS Lambda functions elastically scales your code without thinking about servers or clusters. Alternatively, you can process insights from Amazon Comprehend with microservices architecture patterns. Regardless of the runtime, your code focuses on using the results, not parsing text.

The following diagram illustrates the workflow.

State machine for moderating text

Lambda functions interact with the following APIs:

  • The DetectEntities API discovers and groups the names of real-world objects such as people and places in the text. You can use a custom vocabulary to redact inappropriate and business-specific entity types.
  • The DetectSentiment API identifies the overall sentiment of the text as positive, negative, or neutral. You can train custom classifiers to recognize the industry-specific situations of interest and extract the text’s conceptual meaning.
  • The DetectPIIEntities API identifies PII in your text, such as address, bank account number, or phone number. The output contains the type of PII entity and its corresponding location.

Moderate audio files

To moderate audio files, you must transcribe the file to text and then analyze it. This process has two variants depending on whether you’re processing individual files (synchronous) or live audio streams (asynchronous). Synchronous workflows are ideal for batch processing, with the caller receiving one complete response. In contrast, audio streams require periodic sampling with multiple transcription results.

Amazon Transcribe is an automatic speech recognition service that uses ML models to convert audio to text. You can integrate it into synchronous workflows by starting a transcription job and periodically querying the job’s status. After the job is complete, you can analyze the output using the plain text moderation workflow from the previous step.

The following diagram illustrates the workflow.

State machine for transcribing audio files

The APIs operate as follows:

  • The StartTranscriptionJob API starts an asynchronous job to transcribe speech to text.
  • The GetTranscriptionJob API returns information about a transcription job. To see the status of the job, check the TranscriptionJobStatus field. If the status property is COMPLETED, you can find the results at the location specified in the TranscriptFileUri field. If you enable content redaction, the redacted transcript appears in RedactedTranscriptFileUri.

Live audio streams need a different pattern that supports a real-time delivery model. Streaming can include pre-recorded media, such as movies, music, and podcasts, and real-time media, such as live news broadcasts. You can transcribe audio chunks instantaneously using Amazon Transcribe streaming over HTTP/2 and WebSockets protocols. After posting a chunk to the service, you receive one or more transcription result objects describing the partial and complete transcription segments. Segments that require moderation can reuse the plain text workflow from the previous section. The following diagram illustrates this process.

Flow diagram for moderating real-time audio streams

The StartStreamingTranscription API starts a bidirectional HTTP/2 stream where audio streams to Amazon Transcribe, streaming the transcription results to your application.

Moderate images and photos

Moderating images requires detecting inappropriate, unwanted, or offensive content containing nudity, suggestiveness, violence, and other categories from images and photos content.

Amazon Rekognition enables you to streamline or automate your image and video moderation workflows without requiring ML expertise. Amazon Rekognition returns a hierarchical taxonomy of moderation-related labels. This information makes it easy to define granular business rules per your standards and practices, user safety, and compliance guidelines. ML experience is not required to use these capabilities. Amazon Rekognition can detect and read the text in an image and return bounding boxes for each word found. Amazon Rekognition supports text detection written in English, Arabic, Russian, German, French, Italian, Portuguese, and Spanish!

You can use the machine predictions to automate specific moderation tasks entirely. This capability enables human moderators to focus on higher-order work. In addition, Amazon Rekognition can quickly review millions of images or thousands of videos using ML and flag the subset of assets requiring further action. Prefiltering helps provide comprehensive yet cost-effective moderation coverage while reducing the amount of content that human teams moderate.

The following diagram illustrates the workflow.

State machine for moderating images

The APIs operate as follows:

  • The DetectModerationLabels API detects unsafe content in specified JPEG or PNG formatted images. Use DetectModerationLabels to moderate pictures depending on your requirements. For example, you might want to filter images that contain nudity but not images containing suggestive content.
  • The DetectText API detects text in the input image and converts it into machine-readable text.

Moderate rich text documents

Next, you can use Amazon Textract to extract handwritten text and data from scanned documents. This process begins with invoking the StartDocumentAnalysis action to parse Microsoft Word and Adobe PDF files. You can monitor the job’s progress with the GetDocumentAnalysis action.

The analysis result specifies each uncovered page, paragraph, table, and key-value pair in the document. For example, suppose a health provider must mask patient names in only the claim description field. In that case, the analysis report can power intelligent document processing pipelines that moderate and redact the specific data field. The following diagram illustrates the pipeline.

State machine for moderating rich text documents

The APIs operate as follows:

  • The StartDocumentAnalysis API starts the asynchronous analysis of an input document for relationships between detected items such as key-value pairs, tables, and selection elements
  • The GetDocumentAnalysis API gets the results for an Amazon Textract asynchronous operation that analyzes text in a document

Moderate videos

A standard approach to video content moderation is through a frame sampling procedure. Many use cases don’t need to check every frame, and selecting one every 15–30 seconds is sufficient. Sampled video frames can reuse the state machine to moderate images from the previous section. Similarly, the existing process to moderate audio can support the file’s audible content. The following diagram illustrates this workflow.

State machine for moderating video files

The Invoke API runs a Lambda function and synchronously waits for the response.

Suppose the media file is an entire movie with multiple scenes. In that case, you can use the Amazon Rekognition Segment API, a composite API for detecting technical cues or shot detection. Next, you can use these time offsets to parallel process each segment with the previous video moderation pattern, as shown in the following diagram.

State machine for moderating rich text documents

The APIs operate as follows:

  • The StartSegmentationDetection API starts asynchronous detection of segment detection in a stored video
  • The GetSegmentationDetection API gets the segment detection results of an Amazon Rekognition Video analysis started by the StartSegmentDetection API

Extracting individual frames from the movie doesn’t require fetching the object from Amazon S3 multiple times. A naïve solution involves reading the video into memory and paginating to the end. This pattern is ideal for short clips and where assessments aren’t time-sensitive.

Another strategy entails moving the file once to Amazon Elastic File System (Amazon EFS), a fully managed, scalable, shared file system for other AWS services, such as Lambda. With Amazon EFS for Lambda, you can efficiently distribute data across function invocations. Each invocation efficiently handles a small chunk, unlocking the potential for massively parallel processing and faster processing times.

Clean up

After you experiment with the methods in this post, you should delete any content in S3 buckets to avoid future costs. If you implemented these patterns with provisioned compute resources like Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS), you should stop those instances to avoid further charges.

Conclusion

User-generated content and its value to gaming, social media, ecommerce, and financial and health services organizations will continue to grow. Still, startups and large organizations need to create efficient moderation processes to protect users, information, and the business, while lowering operational costs. This solution demonstrates how AI, ML, and NLP technologies can efficiently help you moderate content at scale. You can customize AWS AI services to address your specific moderation needs! These fully managed capabilities remove operational complexities. That flexibility strategically integrates contextual insights and human talent into your moderation processes.

For additional information, resources, and to get started for free today, visit the AWS content moderation homepage.


About the Authors

Nate Bachmeier is an AWS Senior Solutions Architect that nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing applications. Besides this, Nate is a full-time student and has two kids.

Ram Pathangi is a Solutions Architect at Amazon Web Services in the San Francisco Bay Area. He has helped customers in agriculture, insurance, banking, retail, healthcare and life sciences, hospitality, and hi-tech verticals run their businesses successfully on the AWS Cloud. He specializes in databases, analytics, and machine learning.

Roop Bains is a Solutions Architect at AWS focusing on AI/ML. He is passionate about helping customers innovate and achieve their business objectives using artificial intelligence and machine learning. In his spare time, Roop enjoys reading and hiking.

Read More

More Freedom on the Freeway: AI Lifts Malaysia’s Toll Barriers

Working as an aerospace engineer in Malaysia, Chee How Lim dreamed of building a startup that could really take off. Today his company, Tapway, is riding a wave of computer vision and AI adoption in Southeast Asia.

A call for help in 2019 with video analytics led to the Kuala Lumpur-based company’s biggest project to date.

Malaysia’s largest operator of toll highways, PLUS, wanted to reduce congestion for its more than 1.5 million daily travelers. A national plan called for enabling car, taxi, bus and truck traffic to flow freely across multiple lanes — but that posed several big challenges.

Unsnarling Traffic Jams

The highways charge five classes of tolls depending on vehicle type. Drivers pay using four different systems, and often enter the highway using one payment system, then exit using another, making it hard to track vehicles.

Dedicated lanes for different vehicle classes forced users to stop, slowing traffic, so too booth operators could identify the specific vehicle. Even then some drivers scammed the system, exchanging cards on the highway to get lower tolls.

“We showed them how with computer vision — just a camera and AI — you could solve all that,” said Lim.

AI Smooths the Flow

Using NVIDIA GPUs and software, Tapway trained and ran AI models that could read a vehicle’s license plate and detect its class, make and color in just 50 milliseconds, about one tenth of one eye blink — even if it’s traveling at up to 40 kilometers/hour while approaching a toll plaza.

Tapway’s VehicleTrack software works in all light and weather conditions with a consistent 97 percent accuracy. And thanks in part to NVIDIA Triton Inference Server, a single GPU can manage up to 50 simultaneous video streams.

PLUS has installed 577 cameras so far, and plans to expand to nearly 900 in 92 toll plazas to meet its goal of freely flowing traffic.

Inside a Computer Vision System

Under the hood, the system depends on smart AI models trained in the cloud on a network of NVIDIA A100 and V100 Tensor Core GPUs.

They use a dataset of up to 100,000 images to prepare a new model for a Tapway customer in a few hours, a huge improvement on a CPU-based system that used to take several days, Lim said.

But the real magic comes with inference, running those models in production to process up to 28,800 images a minute on edge servers using NVIDIA A10, A30 and T4 GPUs.

Software Makes it Sing

Tapway uses the NVIDIA DeepStream software development kit to build its computer vision apps, NVIDIA TensorRT to keep its AI models lean and fast, and Triton to play traffic cop, directing AI inference jobs.

“Triton is a real lifesaver for us,” said Lim. “We had some scaling problems doing inference and multithreading on our own and couldn’t scale beyond 12 video streams in a server, but with Triton we easily handle 20 and we’ve tested it on up to 50 simultaneous streams,” he said.

In February, Tapway officially became an NVIDIA Metropolis partner. The program gives companies in intelligent video analytics early access to technology and expertise.

“We had to pass stress tests in areas like multistreaming and security — that helped us strengthen our product offering — and from a business perspective it’s a way to be recognized and further establish ourselves as an AI expert in the region,” Lim said.

AI Covers the Waterfront

Since its start in 2014, Tapway has deployed 3,000 sensors in 500 locations throughout Malaysia and Singapore. Off the road, they help malls and retailers understand customer shopping habits, and now the company is gearing up to help manufacturers like the region’s car makers and palm oil producers inspect products for quality control.

“The demand has never been better, there are a lot of vision challenges in the world, and quite a few exciting projects we hope to land soon,” he said.

To learn more, watch Lim’s talk at GTC (free with registration). And download this free e-book to learn how NVIDIA Metropolis is helping build smarter and safer spaces around the world.

 

The post More Freedom on the Freeway: AI Lifts Malaysia’s Toll Barriers appeared first on NVIDIA Blog.

Read More

Q&A: Chris Rackauckas on the equations at the heart of practically everything

Some people pass the time with hobbies like crossword puzzles or Sudoku. When Chris Rackauckas has a spare moment, he often uses it to answer questions about numerical differential equations that people have posed online. Rackauckas — previously an MIT applied mathematics instructor, now an MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) research affiliate and the co-principal investigator of the MIT Julia Lab — has already posted thousands of these answers, and if you have a question, the odds are that he has already addressed it. His research, unsurprisingly, revolves around differential equations and on computational methods — using AI and other techniques — to solve them quickly and efficiently.

During his graduate studies in mathematics at the University of California at Irvine, which earned him a PhD in 2018, Rackauckas focused on medical and pharmacological applications of his work. In fact, he developed the core software and techniques for Pumas-AI — a Baltimore-based firm that provides software for pharmaceutical modeling and simulation purposes — when he was still a graduate student. He now serves as the company’s director of scientific research.

Since coming to MIT in 2019, Rackauckas has found a much wider range of applications for his “accelerated” differential equation solvers, including global climate modeling and building heating, ventilation, and air conditioning (HVAC) systems. He took time from his efforts to find ever-more rapid ways of attacking differential equations to talk about this work, which has earned him numerous honors, including the 2020 United States Air Force Artificial Intelligence Accelerator Scientific Excellence Award.

Q: How did you get into what you’re doing today?

A: As an undergraduate math major at Oberlin College, I mostly focused on the “methods courses” in scientific domains — statistical methods in psychology, time series econometrics, computational modeling in physics, and so forth. I didn’t have a well-thought-out game plan. I just wanted to understand how science is really done and how we know when our scientific approaches are giving us a correct model of a given system. Fortuitously, that path turned out to be a good one for someone in my current line of work.

In graduate school, I went into biology — specifically combining differential equation solvers with systems biology. The goal there was to make predictive models of how the randomness of a chemical, and its concentration, changes in the body, although at the time I was working with zebra fish. It turns out that systems biology is very close to systems pharmacology. You basically replace fish with humans.

Q: Why are differential equations so important in the world around us?

A: The way I like to describe it is that all scientific experiments are measuring how something changes. How do I go from an understanding of how things change to a prediction of what will happen? That’s what the process of solving a differential equation is all about. Simulations, which are experiments that we carry out on computers, can involve solving thousands upon thousands of differential equations.

Such a simulation might tell you, for instance, not only how a drug concentration changes over time but also how the effects of the drug on the body changes. It’s not the same for every person, so you have to adapt the equations for individuals, depending on their age, weight, etc.

Q: Given your focus on “accelerated” equation solvers, where can you find the best opportunities for speeding things up?

A: The clinical trials for a new drug have a set period of time; you can’t just make the human element faster. But in the preclinical domain, there’s always a period of analysis. Developing a new drug could cost $10 billion, so before you start something like that, you want to know the probability that a drug will work on its target population, as well as the optimal dose for an individual. That’s the purpose of preclinical analysis and quantitative systems pharmacology. Suppose that you typically spend three months on analysis and six months on clinical trials. If you can shorten that analysis from three months to a day — roughly a 100-fold acceleration — you will have cut the time to release a drug by a third.

Then there’s clinical pharmacology, where if you can understand how to get the first dose correct you might be able to save time on repeating elements of the trials. It turns out that my Pumas colleagues and I have already achieved a 175-fold acceleration in preclinical analyses carried out for Pfizer. Moderna also publicly used Pumas and our clinical analysis methods in its clinical analysis of the Covid-19 vaccine and other drugs.

Here’s another opportunity for time and cost savings: Mitsubishi has a facility in Japan for testing HVAC systems. You have to build the entire system and then test it in a building. Each experiment can cost millions of dollars. We’re now working with them to test out, say, 10 different ideas on a computer in order to pick out the one out of those 10 options that they ought to select for a prototype and subsequent experiments.

Q: Can you discuss some other examples of how your work is used?

A: The SciML.ai website keeps a (woefully incomplete) showcase of the amazing ways people have used these methods. CliMA — an Earth system model developed by scientists at Caltech, MIT, and other institutions — relies on the differential equation solvers that I wrote. Recently I was at an applied math conference where a group, independent of me, reported that they had used my software tools to make NASA launch simulations run 15,000 times faster.

Q: What are your plans for the future?

A: There are a lot of things in the pipeline. One application I’ve just started to pursue is predicting the flow of wildfires; another is to predict transient cardiac events like heart attacks, strokes, and arrythmias. A third area I’m moving into is in the realm of neuropsychopharmacology — trying to predict things like the individualized biosignals in bipolar disorder, depression, and schizophrenia in order to design drugs that are better suited for treating these disorders. This is an area where there is a dire need that can lead to much more effective treatments.

In between these projects, I might take a moment to answer the odd question about differential equations. You’ve got to relax sometime.

Read More

Streaming On-Device Detection of Device Directed Speech from Voice and Touch-Based Invocation

When interacting with smart devices such as mobile- phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a but- ton on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or ac- cidental button press, which may have implications on user experience and privacy. To this end, we propose an acous- tic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment…Apple Machine Learning Research

Bilingual End-to-End ASR with Byte-Level Subwords

In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair encoding (BPE), and byte- level byte pair encoding (BBPE) representations, and analyze their strengths and weaknesses. We focus on developing a single end-to- end model to support utterance-based bilingual ASR, where speakers do not alternate between two languages in a single utterance but may change languages across utterances. We conduct our experiments on…Apple Machine Learning Research

Utilizing Imperfect Synthetic Data to Improve Speech Recognition

With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealistic speaking styles. Moreover, the synthesis process may introduce a bias due to uneven sampling of the data manifold. We propose two novel techniques during training to mitigate the problems due to…Apple Machine Learning Research

An Experimental Design Perspective on Model-Based Reinforcement Learning

$$ newcommand{statespace}{mathcal{S}} newcommand{actionspace}{mathcal{A}} newcommand{Rbb}{mathbb{R}} newcommand{Ebb}{mathbb{E}} newcommand{Hbb}{mathbb{H}} DeclareMathOperator{EIG}{EIG}$$

Reinforcement learning (RL) has achieved astonishing successes in domains where the environment is easy to simulate. For example, in games like Go or those in the Atari library, agents can play millions of games in the course of days to explore the environment and find superhuman policies [1]. However, transfer of these advances to broader real-world applications is challenging because the cost of exploration in many important domains is high. For example, while RL-based solutions for controlling plasmas in the nuclear fusion power generation are promising, there is only one operating tokamak in the United States and its resources are in excessive demand. Even the most data-efficient RL algorithms typically take thousands of samples to solve even moderately complex problems [2, 3], which is infeasible in plasma control and many other applications.

In contrast to conventional machine learning settings where the data is given to the decision maker, an RL agent can choose data to learn from. A natural idea for reducing data requirements is to choose data wisely such that a smaller amount of data is sufficient to perform well on a task. In this post, we describe a practical implementation of this idea. Specifically, we offer an answer to the following question: “If we were to collect one additional datapoint from anywhere in the state-action space to best improve our solution to the task, which one would it be?”. This question is related to a more fundamental idea in the design of intelligent agents with limited resources: such agents should be able to understand what information about the world is the most useful to help them accomplish their task. We see this work as a small step towards this bigger goal.

In our recent ICLR paper, An Experimental Design Perspective on Model-Based Reinforcement Learning, we derive an acquisition function that guides an agent in choosing data for the most successful learning. In doing this, we draw a connection between model-based reinforcement learning and Bayesian optimal experimental design (BOED) and evaluate data prospectively in the context of the task reward function and the current uncertainty about the dynamics. Our approach can be efficiently implemented under a conventional assumption of a Gaussian Process (GP) prior on the dynamics function. Typically in BOED, acquisition functions are used to sequentially design experiments that are maximally informative about some quantity of interest by repeatedly choosing the maximizer, running the experiment, and recomputing the acquisition function with the new data. Generalizing this procedure, we propose a simple algorithm that is able to solve a wide variety of control tasks, often using orders of magnitude less data than competitor methods to reach similar asymptotic performance.

Preliminaries

In this work, we consider a RL agent that operates in an environment with unknown dynamics. This is a general RL model for decision-making problems in which the agent starts without any knowledge on how their actions impact the world. The agent then can query different state-action pairs to explore the environment and find a behavior policy that results in the best reward. For example, in the plasma control task, the states are various physical configurations of the plasma and possible actions include injecting power and changing the current. The agent does not have prior knowledge on how its actions impact the conditions of the plasma. Thus, it needs to quickly explore the space to ensure efficient and safe operation of the physical system—a requirement captured in the corresponding reward function.

Given that each observation of a state-action pair is costly, an agent needs to query as few state-action pairs as possible and in this work we develop an algorithm that informs the agent about which queries to make.

We operate under a setting we call transition query reinforcement learning (TQRL). In this setting, an agent can can sequentially query the dynamics at arbitrary states and actions to learn a good policy, essentially teleporting between states as it wishes. Traditionally, in the rollout setting, agents must simply choose actions and execute entire episodes to collect data. TQRL therefore is a slightly more informative form of access to the real environment.

Precise Definition of the Setup

More precisely: we address finite-horizon discrete time Markov Decision Processes (MDPs), which consist of a tuple (langle statespace, actionspace, T, r, p_0, Hrangle) where:

  • (statespace) is the state space
  • (actionspace) is the action space
  • (T) (dynamics) is the stochastic transition function that maps (state, action) pairs (statespace times actionspace) to a probability distribution over states (statespace)
  • (r: statespacetimesactionspace to Rbb) is a reward function
  • (p_0) is a start distribution over states (statespace)
  • (H) is an integer-valued horizon, that is, the number of steps the agent will perform in the environment

We assume that all of these parameters are known besides dynamics (T). The key quantity that defines the behaviour of the agent is its policy (pi: statespace to actionspace) that tells the agent what action to take in a given state. Thus, the overall goal of the agent is to find a policy that maximizes the cumulative reward over the agent’s trajectory (tau sim p(taumid pi, T) ) followed by the agent. Formally, a trajectory is simply a sequence of (state, action) pairs (tau = [s_0, a_0, dots, a_{H -1}, s_H]), where (a_i = pi(s_i)) is an action taken by the agent at step (i) and (s_i sim T(s_{i-1}, a_{i-1})) is a state in which agent was at time (i). Denoting the cumulative reward over trajectory (tau) as (R(tau)), the agent needs to solve the following optimization problem:

$$max_pi Ebb_{tau sim p(taumid pi, T)}left[R(tau)right].$$

We call an optimal policy for a given dynamics (T) as (pi^*_T). As we know the other parts of the MDP, to solve the optimization problem, we need to learn a model for the transition function (hat{T}).

Main Idea

Inspired by BOED and Bayesian algorithm execution (BAX) [11], we use ideas from information theory to motivate our method to effectively choose data points. Our goal is to sequentially choose queries ((s, a)) such that our agent quickly finds a good policy. We observe that to perform the task successfully, we do not need to approximate the optimal policy (pi^*) everywhere in the state space. Indeed, there could be regions of the state space that are unlikely to be visited by the optimal policy. Thus, we only need to approximate the optimal policy in the regions of the state space that are visited by the optimal policy.

Therefore, we choose to learn about (tau^*)—the optimal trajectory governed by the optimal policy (pi^*). This objective only requires data about the areas we believe (pi^*) will visit as it solves the task, so intuitively we should not “waste” samples on irrelevant regions in the state-action space. In plasma control, this idea might look like designing experiments in certain areas of the state and action space that will teach us the most about controlling plasma in the target regimes we need to maintain fusion.

We thus define our acquisition function to be the expected information gain about (tau^*) from sampling a point (T(s, a)) given a dataset (D): $$EIG_{tau^*}(s, a) = Hbb[tau^* mid D] – Ebb_{s’sim T(s, a)}left[Hbb[tau^*mid Dcup {(s, a, s’)}]right].$$ Intuitively, this quantity measures how much the additional data is expected to reduce the uncertainty (here given by Shannon entropy, denoted (Hbb)) about the optimal trajectory.

At a high level, following methods related to the InfoBAX algorithm [11], we can approximate this acquisition function by a three-step procedure:

  • First, we sample many possible dynamics functions from our posterior (e.g., functions that describe plasma evolution).
  • Second, we find optimal trajectories on each of the sampled dynamics functions without taking new data from the environment, as we can simulate controls on these dynamics.
  • Third, we compute predictive entropies of our model at ((s, a)) and of our model with additional data taken from each optimal trajectory. We can then subtract the trajectory-conditioned entropy from the original entropy.

This final step allows us to estimate the mutual information between (T(s, amid D)) and (tau^*), which is precisely the quantity we want. We give a more precise description of this below.

Given a task of regulating plasma to the goal conditions (green dot), we compute posterior samples of the optimal trajectory (paths in color). We can then estimate the point (red circle) with maximal mutual information with these optimal trajectories and query the dynamics at that point.

Computing (EIG_{tau^*}) via posterior function sampling

More formally: as (tau^*) is a high-dimensional object that implicitly assumes access to an optimal decision making policy, it is not obvious that the entropies involved in computing it will be easy to estimate. However, by making two additional assumptions and leveraging properties of mutual information, we can derive a practical method for estimating (EIG_{tau^*}). In particular, we need to assume that:

  • The dynamics (T) are drawn from a GP prior (P(T)), a fairly mild assumption since GPs are universal approximators [10].
  • (pi_T approx pi^*) for an MDP with known rewards and transition function (T), i.e., that a model-predictive control (MPC) policy using known dynamics will be close to optimal on those dynamics. This is not true in all settings and we investigate how crucial this assumption is to the performance of the algorithm in our experiments section.

We know from information theory that $$EIG_{tau^*}(s,a) = I(tau^*; T(s, a)) = Hbb[T(s, a)mid D] – Ebb_{tau^*sim P(tau^* mid D)}left[Hbb[T(s, a)mid Dcup tau^*]right],$$ where (I) refers to the mutual information. This expression is much easier to deal with, given a GP. We can use the fact that given a GP prior the marginal posterior at any point in the domain is a Gaussian in closed form, to exactly compute the left term (Hbb[T(s, a)mid D]). We compute the right term (Ebb_{tau^sim P(tau^ mid D)}left[Hbb[T(s, a)mid Dcup tau^*]right]) via a Monte Carlo approximation, sampling (T’ sim P(T’mid D)) (doable efficiently due to [6]) and then sampling trajectories (tausim P(tau mid T’, pi_{T’})) by executing MPC using (T’) as both the dynamics model used for planning and the dynamics function of the MDP used to sample transitions in (tau). As (tau) is a sequence of state-action transitions, it is essentially made of more data for our estimate of the transition model. So it is straightforward to compute the model posterior (P(T(s, a)mid Dcuptau^*)) (which again must be Gaussian) and read off the entropy of the prediction. The full Monte Carlo estimator is $$EIG_{tau^*}(s, a) approx Hbb[T(s, a)mid D] – frac{1}{n}sum_{i in [n]}Hbb[T(s, a)mid Dcup tau_i]$$ for (tau_i) sampled as described above.

In summary, we can estimate our acquisition function via the following procedure, which is subject to the two assumptions listed above:

  1. Sample many functions (T_isim P(Tmid D))
  2. Sample trajectories (tau_i sim P(taumid T_i, pi_{T_i})) by executing the MPC policy (pi_{T_i}) on the dynamics (T_i).
  3. Compute the entropies (Hbb[T(s, a)mid D]) and (Hbb[T(s, a)mid Dcup tau_i]), for all (i), using standard GP techniques.
  4. Compute the acquisition function using our Monte Carlo estimator.

Inspired by the main ideas of BOED and active learning, we give a simple greedy procedure which we call BARL (Bayesian Active Reinforcement Learning) for using our acquisition function to acquire data given some initial dataset:

  1. Compute (EIG_{tau^*}(s, a)) given the dataset for a large random set of state-action pairs. Samples of (tau^*) can be reused between these points.
  2. Sample (s’ sim T(s, a)) for the (s, a) that was found to maximize the acquisition function and add (s, a, s’) to the dataset.
  3. Repeat steps 1-2 until the query budget is exhausted. The evaluation policy is simply MPC on the GP posterior mean.

Does BARL reduce the data requirements of RL?

We evaluate BARL on the TQRL setting in 5 environments which span a variety of reward function types, dimensionalities, and amounts of required data. In this evaluation, we estimate the minimum amount of data an algorithm needs to learn a controller. The evaluation environments include the standard underactuated pendulum swing-up task, a cartpole swing-up task, the standard 2-DOF reacher task, a navigation problem where the agent must find a path across pools of lava, and a simulated nuclear fusion control problem where the agent is tasked with modulating the power injected into the plasma to achieve a target pressure.

To assess the performance of BARL in solving MDPs quickly, we assembled a group of reinforcement learning algorithms that represent the state of the art in solving continuous MDPs. We compare against model-based algorithms PILCO [7], PETS [2], model-predictive control with a GP (MPC), and uncertainty sampling with a GP ((EIG_T)), as well as model-free algorithms SAC [3], TD3 [8], and PPO [9]. Besides the uncertainty sampling (which operates in the TQRL setting and is directly comparable to BARL), these methods rely on the rollout setting for RL and are somewhat disadvantaged relative to BARL.

BARL clearly outperforms each of the comparison methods in nearly every problem in data efficiency. We see that simpler methods like (EIG_T) and MPC perform well on lower-dimensional problems like Lava Path, Pendulum, and Beta Tracking, but struggle with the higher-dimensional Reacher Problem. Model-free methods like SAC, TD3, and PPO are notably sample-hungry.

Sample Complexity: Median number of samples across 5 seeds required to reach the performance of MPC on the ground truth dynamics, averaged across 5 trials on our control environments. We record N/A when the median run is unable to solve the problem by the end of training.

After further investigation, we also find that models that used data chosen by BARL are more accurate on the datapoints required to solve the problem and less accurate on a randomly chosen test set of points than models using data chosen via (EIG_T). This implies that BARL is choosing the ‘right data’. Since the same model is used in both of these methods, there will inevitably be areas of the input space where each of the methods performs better than the other. BARL performs better on the areas that are needed to solve the problem.

We compare BARL and an uncertainty sampling baseline (EIG_T) on three criteria. In the left chart, we plot control performance as queries are made. (pi_T) is the performance of MPC with a perfect model. In the middle, we plot modeling errors for BARL vs (EIG_T) on the points where the model is queried in order to plan actions. On the right, we plot modeling errors on a uniform test set. BARL models the dynamics well on the points required to plan the optimal actions (middle) while not learning the dynamics well in general (right). This focus on choosing relevant datapoints allows BARL to solve the task quickly (left).

Conclusion

We believe (EIG_{tau^*}) is an important first step towards agents that think proactively about the data that they will acquire in the future. Though we are encouraged by the strong performance we have seen so far, there is substantial future work to be done. In particular, we are currently working to extend BARL to the rollout setting by planning actions that will lead to maximum information in the future. We also aim to solve problems of scaling these ideas to the high-dimensional state and action spaces that are necessary for many real-world problems.

References

[1] Mastering the game of Go with deep neural networks and tree search, Silver et al, Nature 2016

[2] Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models, Chua et al, Neurips 2018

[3] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, ICML 2018

[4] Cross-Entropy Randomized Motion Planning, Kobilarov et al, RSS 2008

[5] Sample-efficient Cross-Entropy Method for Real-time Planning, Pinneri et al, CoRL 2020

[6] Efficiently Sampling Functions from Gaussian Process Posteriors, Wilson et al, ICML 2020

[7] PILCO: A Model-Based and Data-Efficient Approach to Policy Search, Deisenroth & Rasmussen, ICML 2011

[8] Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al, ICML 2018

[9] Proximal Policy Optimization Algorithms, Schulman et al, 2017

[10] Universal Kernels, Micchelli et al, JMLR 2006

[11] Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information, Neiswanger et al, ICML 2021

Read More