Simulated Spotify Listening Experiences for Reinforcement Learning with TensorFlow and TF-Agents

Simulated Spotify Listening Experiences for Reinforcement Learning with TensorFlow and TF-Agents

Posted by Surya Kanoria, Joseph Cauteruccio, Federico Tomasi, Kamil Ciosek, Matteo Rinaldi, and Zhenwen Dai – Spotify

Introduction

Many of our music recommendation problems involve providing users with ordered sets of items that satisfy users’ listening preferences and intent at that point in time. We base current recommendations on previous interactions with our application and, in the abstract, are faced with a sequential decision making process as we continually recommend content to users.

Reinforcement Learning (RL) is an established tool for sequential decision making that can be leveraged to solve sequential recommendation problems. We decided to explore how RL could be used to craft listening experiences for users. Before we could start training Agents, we needed to pick a RL library that allowed us to easily prototype, test, and potentially deploy our solutions.

At Spotify we leverage TensorFlow and the extended TensorFlow Ecosystem (TFX, TensorFlow Serving, and so on) as part of our production Machine Learning Stack. We made the decision early on to leverage TensorFlow Agents as our RL Library of choice, knowing that integrating our experiments with our production systems would be vastly more efficient down the line.

One missing bit of technology we required was an offline Spotify environment we could use to prototype, analyze, explore, and train Agents offline prior to online testing. The flexibility of the TF-Agents library, coupled with the broader advantages of TensorFlow and its ecosystem, allowed us to cleanly design a robust and extendable offline Spotify simulator.

We based our simulator design on TF-Agents Environment primitives and using this simulator we developed, trained and evaluated sequential models for item recommendations, vanilla RL Agents (PPG, DQN) and a modified deep Q-Network, which we call the Action-Head DQN (AH-DQN), that addressed the specific challenges imposed by the large state and action space of our RL formulation.

Through live experiments we were able to show that our offline performance estimates were strongly correlated with online results. This then opened the door for large scale experimentation and application of Reinforcement Learning across Spotify, enabled by the technological foundations unlocked by TensorFlow and TF-Agents.

In this post we’ll provide more details about our RL problem and how we used TF-Agents to enable this work end to end.

The RL Loop and Simulated Users

Reinforcement Learning loop
In RL, Agents interact with the environment continuously. At a given time step the Agent consumes an observation from the environment and, using this observation, produces an action given its policy at time t. The environment then processes the action and emits both a reward and the next observation (note that although typically used interchangeably, State is the complete information required to summarize the environment post action, Observation is the portion of this information actually exposed to the Agent).

In our case the reward emitted from the environment is the response of a user to music recommendations driven by the Agent’s action. In the absence of a simulator we would need to expose real users to Agents to observe rewards. We utilize a model-based RL approach to avoid letting an untrained Agent interact with real users (with the potential of hurting user satisfaction in the training process).

In this model-based RL formulation the Agent is not trained online against real users. Instead, it makes use of a user model that predicts responses to a list of tracks derived via the Agent’s action. Using this model we optimize actions in such a way as to maximize a (simulated) user satisfaction metric. During the training phase the environment makes use of this user model to return a predicted user response to the action recommended by the Agent.

We use Keras to design and train our user model. The serialized user model is then unpacked by the simulator and used to calculate rewards during Agent training and evaluation.

Simulator Design

In the abstract, what we needed to build was clear. We needed a way to simulate user listening sessions for the Agent. Given a simulated user and some content, instantiate a listening session and let the Agent drive recommendations in that session. Allow the simulated user to “react” to these recommendations and let the Agent adjust its strategy based on this result to drive some expected cumulative reward.

The TensorFlow Agents environment design guided us in developing the modular components of our system, each of which was responsible for different parts of the overall simulation.

In our codebase we define an environment abstraction that requires the following be defined for every concrete instantiation:

class AbstractEnvironment(ABC):
_user_model: AbstractUserModel = None
_track_sampler: AbstractTrackSampler = None
_episode_tracker: EpisodeTracker = None
_episode_sampler: AbstractEpisodeSampler = None

    @abstractmethod
    def reset(self) -> List[float]:
pass

    @abstractmethod
    def step(self, action: float) -> (List[float], float, bool):
pass

    def observation_space(self) -> Dict:
pass

    @abstractmethod
    def action_space(self) -> Dict:
pass

Set-Up

At the start of Agent training we need to instantiate a simulation environment that has representations of hypothetical users and the content we’re looking to recommend to them. We base these instantiations on both real and hypothetical Spotify listening experiences. The critical information that defines these instantiations is passed to the environment via _episode_sampler. As mentioned, we also need to provide the simulator with a trained user model, in this case via _user_model.
Flow chart of agent training set up

Actions and Observations

Just like any Agent environment, our simulator requires that we specify the action_spec and observation_spec. Actions in our case may be continuous or discrete depending both on our Agent selection and how we propose to translate an Agent’s action into actual recommendations. We typically recommend ordered lists of items drawn from a pool of potential items. Formulating this action space directly would lead to it being combinatorially complex. We also assume the user will interact with multiple items, and as such previous work in this area that relies on single choice assumptions doesn’t apply.

In the absence of a discrete action space consisting of item collections we need to provide the simulator with a method for turning the Agent’s action into actual recommendations. This logic is contained in the via _track_sampler. The “example play modes” proposed by the episode sampler contains information on items that can be presented to the simulated user. The track sampler consumes these and the agent’s action and returns actual item recommendations.
Flow chart of Agent actions_spec and observation_spec combining to create a recommendation

Termination and Reset

We also need to handle the episode termination dynamics. In our simulator, the reset rules are set by the model builder and based on empirical investigations of interaction data relevant to a specific music listening experience. As a hypothetical, we may determine that 92% of listening sessions terminate after 6 sequential track skips and we’d construct our simulation termination logic to match. It also requires that we design abstractions in our simulator that allow us to check if the episode should be terminated after each step.

When the episode is reset the simulator will sample a new hypothetical user listening session pair and begin the next episode.

Episode Steps

As with standard TF Agents Environments we need to define the step dynamics for our simulation. We have optional dynamics of the simulation that we need to make sure are enforced at each step. For example, we may desire that the same item cannot be recommended more than once. If the Agent’s action indicates a recommendation of an item that was previously recommended we need to build in the functionality to pick the next best item based on this action.

We also need to call the termination (and other supporting functions) mentioned above as needed at each step.

Episode Storage and Replay

The functionality mentioned up until this point collectively created a very complex simulation setup. While the TF Agents replay buffer provided us with the functionality required to store episodes for Agent training and evaluation, we quickly realized the need to be able to store more episode data for debugging purposes, and more detailed evaluations specific to our simulation distinct from standard Agent performance measures.

We thus allowed for the inclusion of an expanded _episode_tracker that would store additional information about the user model predictions, information noting the sampled users/content pairs, and more.

Creating TF-Agent Environments

Our environment abstraction gives us a template that matches that of a standard TF-Agents Environment class. Some inputs to our environment need to be resolved before we can actually create the concrete TF-Agents environment instance. This happens in three steps.

First we define a specific simulation environment that conforms to our abstraction. For example:

class PlaylistEnvironment(AbstractEnvironment):
def __init__(
self,
user_model: AbstractUserModel,
track_sampler: AbstractTrackSampler,
episode_tracker: EpisodeTracker,
episode_sampler: AbstractEpisodeSampler,
....
):

...

Next we use an Environment Builder Class that takes as input a user model, track sampler, etc. and an environment class like PlaylistEnvironment. The builder creates a concrete instance of this environment:

self.playlist_env: PlaylistEnvironment = environment_ctor(
user_model=user_model,
track_sampler=track_sampler,
episode_tracker=episode_tracker,
episode_sampler=self._eps_sampler,
)

Lastly, we utilize a conversion class that constructs a TF-Agents Environment from a concrete instance of ours:

class TFAgtPyEnvironment(py_environment.PyEnvironment):
    def __init__(self, environment: AbstractEnvironment):
  super().__init__()
  self.env = environment

This is then executed internally to our Environment Builder:

class EnvironmentBuilder(AbstractEnvironmentBuilder):

      def __init__(self, ...):
      ...

      def get_tf_env(self):
       ...
      tf_env: TFAgtPyEnvironment = TFAgtPyEnvironment(
    self.playlist_env
      )
      return tf_env

The resulting TensorFlow Agents environment can then be used for Agent training.
Flow chart showing simulator design
This simulator design allows us to easily create and manage multiple environments with a variety of different configurations as needed.

We next discuss how we used our simulator to train RL Agents to generate Playlists.

A Customized Agent for Playlist Generation

As mentioned, Reinforcement Learning provides us with a method set that naturally accommodates the sequential nature of music listening; allowing us to adapt to users’ ever evolving preferences as sessions progress.

One specific problem we can attempt to use RL to solve is that of automatic music playlist generation. Given a (large) set of tracks, we want to learn how to create one optimal playlist to recommend to the user in order to maximize satisfaction metrics. Our use case is different from standard slate recommendation tasks, where usually the target is to select at most one item in the sequence. In our case, we assume we have a user-generated response for multiple items in the slate, making slate recommendation systems not directly applicable. Another complication is that the set of tracks from which recommendations are drawn is ever changing.

We designed a DQN variant capable of handling these constraints that we called an Action Head DQN (AHDQN).
Moving image of AH-DQN network creating recommendations based on changing variables
The AH-DQN network takes as input the current state and an available action to produce a single Q value for the input action. This process is repeated for every possible item in the input. Finally, the item with the highest Q value is selected and added to the slate, and the process continues until the slate is full.

Experiments In Brief

We tested our approach both offline and online at scale to assess the ability of the Agent to power our real-world recommender systems. In addition to testing the Agent itself we were also keen to assess the extent to which our offline performance estimates for various policies returned by our simulator matched (or at least directionally aligned) with our online results.
Graph measuring simulated performance assessment by scaled online reward for different policies

We observed this directional alignment for numerous naive, heuristic, model driven, and RL policies.

Please refer to our KDD paper for more information on the specifics of our model-based RL approach and Agent design.

Federico Tomasi, Joseph Cauteruccio, Surya Kanoria, Kamil Ciosek, Matteo Rinaldi, and Zhenwen Dai
KDD 2023

Acknowledgements

We’d like to thank all our Spotify teammates past and present who contributed to this work. Particularly, we’d like to thank Mehdi Ben Ayed for his early work in helping to develop our RL codebase. We’d also like to thank the TensorFlow Agents team for their support and encouragement throughout this project (and for the library that made it possible).

Read More

Defect detection in high-resolution imagery using two-stage Amazon Rekognition Custom Labels models

Defect detection in high-resolution imagery using two-stage Amazon Rekognition Custom Labels models

High-resolution imagery is very prevalent in today’s world, from satellite imagery to drones and DLSR cameras. From this imagery, we can capture damage due to natural disasters, anomalies in manufacturing equipment, or very small defects such as defects on printed circuit boards (PCBs) or semiconductors. Building anomaly detection models using high-resolution imagery can be challenging because modern computer vision models typically resize images to a lower resolution to fit into memory for training and running inference. Reducing the image resolution significantly means that visual information relating to the defect is degraded or completely lost.

One approach to overcome these challenges is to build two-stage models. Stage 1 models detect a region of interest, and Stage 2 models detect defects on the cropped region of interest, thereby maintaining sufficient resolution for small detects.

In this post, we go over how to build an effective two-stage defect detection system using Amazon Rekognition Custom Labels and compare results for this specific use case with one-stage models. Note that several one-stage models are effective even at lower or resized image resolutions, and others may accommodate large images in smaller batches.

Solution overview

For our use case, we use a dataset of images of PCBs with synthetically generated missing hole pins, as shown in the following example.

We use this dataset to demonstrate that a one-stage approach using object detection results in subpar detection performance for the missing hole pin defects. A two-step model is preferred, in which we use Rekognition Custom Labels first for object detection to identify the pins and then a second-stage model to classify cropped images of the pins into pins with missing holes or normal pins.

The training process for a Rekognition Custom Labels model consists of several steps, as illustrated in the following diagram.

First, we use Amazon Simple Storage Service (Amazon S3) to store the image data. The data is ingested in Amazon Sagemaker Jupyter notebooks, where typically a data scientist will inspect the images and preprocess them, removing any images that are of poor quality such as blurred images or poor lighting conditions, and resize or crop the images. Then data is split into training and test sets, and Amazon SageMaker Ground Truth labeling jobs are run to label the sets of images and output a train and test manifest file. The manifest files are used by Rekognition Custom Labels for training.

One-stage model approach

The first approach we take to identifying missing holes on the PCB is to label the missing holes and train an object detection model to identify the missing holes. The following is an image example from the dataset.

We train a model with a dataset with 95 images used as training and 20 images used for testing. The following table summarizes our results.

Evaluation Results
F1 Score Average Precision Overall Recall
0.468 0.750 0.340
Training Time Training Dataset Testing Dataset
Trained in 1.791 hours 1 label, 95 images 1 label, 20 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
missing_hole 0.468 20 0.750 0.340 0.053

The resulting model has high precision but low recall, meaning that when we localize a region for a missing hole, we’re usually correct, but we’re missing a lot of missing holes that are present on the PCB. To build an effective defect detection system, we need to improve recall. The low performance of this model may be due to the defects being small on this high-resolution image of the PCB, so the model has no reference of a healthy pin.

Next, we explore splitting the image into four or six crops depending on the PCB size and labeling both healthy and missing holes. The following is an example of the resulting cropped image.

We train a model with 524 images used as training and 106 images used for testing. We maintain the same PCBs used in train and test as the full board model. The results for cropped healthy pins vs. missing holes are shown in the following table.

Evaluation Results
F1 Score Average Precision Overall Recall
0.967 0.989 0.945
Training Time Training Dataset Testing Dataset
Trained in 2.118 hours 2 labels, 524 images 2 labels, 106 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
missing_hole 0.949 42 0.980 0.920 0.536
pin 0.984 106 0.998 0.970 0.696

Both precision and recall have improved significantly. Training the model with zoomed-in cropped images and a reference to the model for healthy pins helped. However, recall is still at 92%, meaning that we would still miss 8% of the missing holes and let defects go by unnoticed.

Next, we explore a two-stage model approach in which we can improve the model performance further.

Two-stage model approach

For the two-stage model, we train two models: one for detecting pins and one for detecting if the pin is missing or not on zoomed-in cropped images of the pin. The following is an image from the pin detection dataset.

The data is similar to our previous experiment, in which we cropped the PCB into four or six cropped images. This time, we label all pins and don’t make any distinctions if the pin has a missing hole or not. We train this model with 522 images and test with 108 images, maintaining the same train/test split as previous experiments. The results are shown in the following table.

Evaluation Results
F1 Score Average Precision Overall Recall
1.000 0.999 1.000
Training Time Training Dataset Testing Dataset
Trained in 1.581 hours 1 label, 522 images 1 label, 108 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
pin 1.000 108 0.999 1.000 0.617

The model detects the pins perfectly on this synthetic dataset.

Next, we build the model to make the distinction for missing holes. We use cropped images of the holes to train the second stage of the model, as shown in the following examples. This model is separate from the previous models because it’s a classification model and will be focused on the narrow task of determining if the pin has a missing hole.

We train this second-stage model on 16,624 images and test on 3,266, maintaining the same train/test splits as the previous experiments. The following table summarizes our results.

Evaluation Results
F1 Score Average Precision Overall Recall
1.000 1.000 1.000
Training Time Training Dataset Testing Dataset
Trained in 6.660 hours 2 labels, 16,624 images 2 labels, 3,266 images
Per Label Performance
Label Name F1 Score Test Images Precision Recall Assumed Threshold
anomaly 1.000 88 1.000 1.000 0.960
normal 1.000 3,178 1.000 1.000 0.996

Again, we receive perfect precision and recall on this synthetic dataset. Combining the previous pin detection model with this second-stage missing hole classification model, we can build a model that outperforms any single-stage model.

The following table summarizes the experiments we conducted.

Experiment Type Description F1 Score Precision Recall
1 One-stage model Object detection model to detect missing holes on full images 0.468 0.75 0.34
2 One-stage model Object detection model to detect healthy pins and missing holes on cropped images 0.967 0.989 0.945
3 Two-stage model Stage 1: Object detection on all pins 1.000 0.999 1.000
Stage 2: Image classification of healthy pin or missing holes 1.000 1.000 1.000
End-to-end average 1.000 0.9995 1.000

Inference pipeline

You can use the following architecture to deploy the one-stage and two-stage models that we described in this post. The following main components are involved:

For one-stage models, you can send an input image to the API Gateway endpoint, followed by Lambda for any basic image preprocessing, and route to the Rekognition Custom Labels trained model endpoint. In our experiments, we explored one-stage models that can detect only missing holes, and missing holes and healthy pins.

For two-stage models, you can similarly send an image to the API Gateway endpoint, followed by Lambda. Lambda acts as an orchestrator that first calls the object detection model (trained using Rekognition Custom Labels), which generates the region of interest. The original image is then cropped in the Lambda function, and sent to another Rekognition Custom Labels classification model for detecting defects in each cropped image.

Conclusion

In this post, we trained one- and two-stage models to detect missing holes in PCBs using Rekognition Custom Labels. We reported results for various models; in our case, two-stage models outperformed other variants. We encourage customers with high-resolution imagery from other domains to test model performance with one- and two-stage models. Additionally, consider the following ways to expand the solution:

  • Sliding window crops for your actual datasets
  • Reusing your object detection models in the same pipeline
  • Pre-labeling workflows using bounding box predictions

About the authors

Andreas Karagounis is a Data Science Manager at Accenture. He holds a masters in Computer Science from Brown University. He has a background in computer vision and works with customers to solve their business challenges using data science and machine learning.

Yogesh Chaturvedi is a Principal Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Shreyas Subramanian is a Principal Data Scientist, and helps customers by using machine learning to solve their business challenges using the AWS platform. Shreyas has a background in large-scale optimization and machine learning, and in the use of machine learning and reinforcement learning for accelerating optimization tasks.

Selimcan “Can” Sakar is a cloud-first developer and Solutions Architect at AWS Accenture Business Group with a focus on emerging technologies such as GenAI, ML, and blockchain. When he isn’t watching models converge, he can be seen biking or playing the clarinet.

Read More

Automatically redact PII for machine learning using Amazon SageMaker Data Wrangler

Automatically redact PII for machine learning using Amazon SageMaker Data Wrangler

Customers increasingly want to use deep learning approaches such as large language models (LLMs) to automate the extraction of data and insights. For many industries, data that is useful for machine learning (ML) may contain personally identifiable information (PII). To ensure customer privacy and maintain regulatory compliance while training, fine-tuning, and using deep learning models, it’s often necessary to first redact PII from source data.

This post demonstrates how to use Amazon SageMaker Data Wrangler and Amazon Comprehend to automatically redact PII from tabular data as part of your machine learning operations (ML Ops) workflow.

Problem: ML data that contains PII

PII is defined as any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. PII is information that either directly identifies an individual (name, address, social security number or other identifying number or code, telephone number, email address, and so on) or information that an agency intends to use to identify specific individuals in conjunction with other data elements, namely, indirect identification.

Customers in business domains such as financial, retail, legal, and government deal with PII data on a regular basis. Due to various government regulations and rules, customers have to find a mechanism to handle this sensitive data with appropriate security measures to avoid regulatory fines, possible fraud, and defamation. PII redaction is the process of masking or removing sensitive information from a document so it can be used and distributed, while still protecting confidential information.

Businesses need to deliver delightful customer experiences and better business outcomes by using ML. Redaction of PII data is often a key first step to unlock the larger and richer data streams needed to use or fine-tune generative AI models, without worrying about whether their enterprise data (or that of their customers) will be compromised.

Solution overview

This solution uses Amazon Comprehend and SageMaker Data Wrangler to automatically redact PII data from a sample dataset.

Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover insights and relationships in unstructured data, with no managing infrastructure or ML experience required. It provides functionality to locate various PII entity types within text, for example names or credit card numbers. Although the latest generative AI models have demonstrated some PII redaction capability, they generally don’t provide a confidence score for PII identification or structured data describing what was redacted. The PII functionality of Amazon Comprehend returns both, enabling you to create redaction workflows that are fully auditable at scale. Additionally, using Amazon Comprehend with AWS PrivateLink means that customer data never leaves the AWS network and is continuously secured with the same data access and privacy controls as the rest of your applications.

Similar to Amazon Comprehend, Amazon Macie uses a rules-based engine to identify sensitive data (including PII) stored in Amazon Simple Storage Service (Amazon S3). However, its rules-based approach relies on having specific keywords that indicate sensitive data located close to that data (within 30 characters). In contrast, the NLP-based ML approach of Amazon Comprehend uses sematic understanding of longer chunks of text to identify PII, making it more useful for finding PII within unstructured data.

Additionally, for tabular data such as CSV or plain text files, Macie returns less detailed location information than Amazon Comprehend (either a row/column indicator or a line number, respectively, but not start and end character offsets). This makes Amazon Comprehend particularly helpful for redacting PII from unstructured text that may contain a mix of PII and non-PII words (for example, support tickets or LLM prompts) that is stored in a tabular format.

Amazon SageMaker provides purpose-built tools for ML teams to automate and standardize processes across the ML lifecycle. With SageMaker MLOps tools, teams can easily prepare, train, test, troubleshoot, deploy, and govern ML models at scale, boosting productivity of data scientists and ML engineers while maintaining model performance in production. The following diagram illustrates the SageMaker MLOps workflow.

SageMaker Pipelines

SageMaker Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze datasets stored in locations such as Amazon S3 or Amazon Athena, a common first step in the ML lifecycle. You can use SageMaker Data Wrangler to simplify and streamline dataset preprocessing and feature engineering by either using built-in, no-code transformations or customizing with your own Python scripts.

Using Amazon Comprehend to redact PII as part of a SageMaker Data Wrangler data preparation workflow keeps all downstream uses of the data, such as model training or inference, in alignment with your organization’s PII requirements. You can integrate SageMaker Data Wrangler with Amazon SageMaker Pipelines to automate end-to-end ML operations, including data preparation and PII redaction. For more details, refer to Integrating SageMaker Data Wrangler with SageMaker Pipelines. The rest of this post demonstrates a SageMaker Data Wrangler flow that uses Amazon Comprehend to redact PII from text stored in tabular data format.

This solution uses a public synthetic dataset along with a custom SageMaker Data Wrangler flow, available as a file in GitHub. The steps to use the SageMaker Data Wrangler flow to redact PII are as follows:

  1. Open SageMaker Studio.
  2. Download the SageMaker Data Wrangler flow.
  3. Review the SageMaker Data Wrangler flow.
  4. Add a destination node.
  5. Create a SageMaker Data Wrangler export job.

This walkthrough, including running the export job, should take 20–25 minutes to complete.

Prerequisites

For this walkthrough, you should have the following:

Open SageMaker Studio

To open SageMaker Studio, complete the following steps:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. Choose the domain and user profile
  3. Choose Open Studio.

To get started with the new capabilities of SageMaker Data Wrangler, it’s recommended to upgrade to the latest release.

Download the SageMaker Data Wrangler flow

You first need to retrieve the SageMaker Data Wrangler flow file from GitHub and upload it to SageMaker Studio. Complete the following steps:

  1. Navigate to the SageMaker Data Wrangler redact-pii.flow file on GitHub.
  2. On GitHub, choose the download icon to download the flow file to your local computer.
  3. In SageMaker Studio, choose the file icon in the navigation pane.
  4. Choose the upload icon, then choose redact-pii.flow.

Upload Data Wrangler Flow

Review the SageMaker Data Wrangler flow

In SageMaker Studio, open redact-pii.flow. After a few minutes, the flow will finish loading and show the flow diagram (see the following screenshot). The flow contains six steps: an S3 Source step followed by five transformation steps.

Data Wrangler Flow steps

On the flow diagram, choose the last step, Redact PII. The All Steps pane opens on the right and shows a list of the steps in the flow. You can expand each step to view details, change parameters, and potentially add custom code.

Data Wrangler Flow step details

Let’s walk through each step in the flow.

Steps 1 (S3 Source) and 2 (Data types) are added by SageMaker Data Wrangler whenever data is imported for a new flow. In S3 Source, the S3 URI field points to the sample dataset, which is a CSV file stored in Amazon S3. The file contains roughly 116,000 rows, and the flow sets the value of the Sampling field to 1,000, which means that SageMaker Data Wrangler will sample 1,000 rows to display in the user interface. Data types sets the data type for each column of imported data.

Step 3 (Sampling) sets the number of rows SageMaker Data Wrangler will sample for an export job to 5,000, via the Approximate sample size field. Note that this is different from the number of rows sampled to display in the user interface (Step 1). To export data with more rows, you can increase this number or remove Step 3.

Steps 4, 5, and 6 use SageMaker Data Wrangler custom transforms. Custom transforms allow you to run your own Python or SQL code within a Data Wrangler flow. The custom code can be written in four ways:

  • In SQL, using PySpark SQL to modify the dataset
  • In Python, using a PySpark data frame and libraries to modify the dataset
  • In Python, using a pandas data frame and libraries to modify the dataset
  • In Python, using a user-defined function to modify a column of the dataset

The Python (pandas) approach requires your dataset to fit into memory and can only be run on a single instance, limiting its ability to scale efficiently. When working in Python with larger datasets, we recommend using either the Python (PySpark) or Python (user-defined function) approach. SageMaker Data Wrangler optimizes Python user-defined functions to provide performance similar to an Apache Spark plugin, without needing to know PySpark or Pandas. To make this solution as accessible as possible, this post uses a Python user-defined function written in pure Python.

Expand Step 4 (Make PII column) to see its details. This step combines different types of PII data from multiple columns into a single phrase that is saved in a new column, pii_col. The following table shows an example row containing data.

customer_name customer_job billing_address customer_email
Katie Journalist 19009 Vang Squares Suite 805 hboyd@gmail.com

This is combined into the phrase “Katie is a Journalist who lives at 19009 Vang Squares Suite 805 and can be emailed at hboyd@gmail.com”. The phrase is saved in pii_col, which this post uses as the target column to redact.

Step 5 (Prep for redaction) takes a column to redact (pii_col) and creates a new column (pii_col_prep) that is ready for efficient redaction using Amazon Comprehend. To redact PII from a different column, you can change the Input column field of this step.

There are two factors to consider to efficiently redact data using Amazon Comprehend:

  • The cost to detect PII is defined on a per-unit basis, where 1 unit = 100 characters, with a 3-unit minimum charge for each document. Because tabular data often contains small amounts of text per cell, it’s generally more time- and cost-efficient to combine text from multiple cells into a single document to send to Amazon Comprehend. Doing this avoids the accumulation of overhead from many repeated function calls and ensures that the data sent is always greater than the 3-unit minimum.
  • Because we’re doing redaction as one step of a SageMaker Data Wrangler flow, we will be calling Amazon Comprehend synchronously. Amazon Comprehend sets a 100 KB (100,000 character) limit per synchronous function call, so we need to ensure that any text we send is under that limit.

Given these factors, Step 5 prepares the data to send to Amazon Comprehend by appending a delimiter string to the end of the text in each cell. For the delimiter, you can use any string that doesn’t occur in the column being redacted (ideally, one that is as few characters as possible, because they’re included in the Amazon Comprehend character total). Adding this cell delimiter allows us to optimize the call to Amazon Comprehend, and will be discussed further in Step 6.

Note that if the text in any individual cell is longer than the Amazon Comprehend limit, the code in this step truncates it to 100,000 characters (roughly equivalent to 15,000 words or 30 single-spaced pages). Although this amount of text is unlikely to be stored in in a single cell, you can modify the transformation code to handle this edge case another way if needed.

Step 6 (Redact PII) takes a column name to redact as input (pii_col_prep) and saves the redacted text to a new column (pii_redacted). When you use a Python custom function transform, SageMaker Data Wrangler defines an empty custom_func that takes a pandas series (a column of text) as input and returns a modified pandas series of the same length. The following screenshot shows part of the Redact PII step.

Data Wrangler custom function redaction code

The function custom_func contains two helper (inner) functions:

  • make_text_chunks – This function does the work of concatenating text from individual cells in the series (including their delimiters) into longer strings (chunks) to send to Amazon Comprehend.
  • redact_pii– This function takes text as input, calls Amazon Comprehend to detect PII, redacts any that is found, and returns the redacted text. Redaction is done by replacing any PII text with the type of PII found in square brackets, for example John Smith would be replaced with [NAME]. You can modify this function to replace PII with any string, including the empty string (“”) to remove it. You also could modify the function to check the confidence score of each PII entity and only redact if it’s above a specific threshold.

After the inner functions are defined, custom_func uses them to do the redaction, as shown in the following code excerpt. When the redaction is complete, it converts the chunks back into original cells, which it saves in the pii_redacted column.

# concatenate text from cells into longer chunks
chunks = make_text_chunks(series, COMPREHEND_MAX_CHARS)

redacted_chunks = []
# call Comprehend once for each chunk, and redact
for text in chunks:
  redacted_text = redact_pii(text)
  redacted_chunks.append(redacted_text)
  
# join all redacted chunks into one text string
redacted_text = ''.join(redacted_chunks)

# split back to list of the original rows
redacted_rows = redacted_text.split(CELL_DELIM)

Add a destination node

To see the result of your transformations, SageMaker Data Wrangler supports exporting to Amazon S3, SageMaker Pipelines, Amazon SageMaker Feature Store, and Python code. To export the redacted data to Amazon S3, we first need to create a destination node:

  1. In the SageMaker Data Wrangler flow diagram, choose the plus sign next to the Redact PII step.
  2. Choose Add destination, then choose Amazon S3.
  3. Provide an output name for your transformed dataset.
  4. Browse or enter the S3 location to store the redacted data file.
  5. Choose Add destination.

You should now see the destination node at the end of your data flow.

Create a SageMaker Data Wrangler export job

Now that the destination node has been added, we can create the export job to process the dataset:

  1. In SageMaker Data Wrangler, choose Create job.
  2. The destination node you just added should already be selected. Choose Next.
  3. Accept the defaults for all other options, then choose Run.

This creates a SageMaker Processing job. To view the status of the job, navigate to the SageMaker console. In the navigation pane, expand the Processing section and choose Processing jobs. Redacting all 116,000 cells in the target column using the default export job settings (two ml.m5.4xlarge instances) takes roughly 8 minutes and costs approximately $0.25. When the job is complete, download the output file with the redacted column from Amazon S3.

Clean up

The SageMaker Data Wrangler application runs on an ml.m5.4xlarge instance. To shut it down, in SageMaker Studio, choose Running Terminals and Kernels in the navigation pane. In the RUNNING INSTANCES section, find the instance labeled Data Wrangler and choose the shutdown icon next to it. This shuts down the SageMaker Data Wrangler application running on the instance.

Conclusion

In this post, we discussed how to use custom transformations in SageMaker Data Wrangler and Amazon Comprehend to redact PII data from your ML dataset. You can download the SageMaker Data Wrangler flow and start redacting PII from your tabular data today.

For other ways to enhance your MLOps workflow using SageMaker Data Wrangler custom transformations, check out Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy. For more data preparation options, check out the blog post series that explains how to use Amazon Comprehend to react, translate, and analyze text from either Amazon Athena or Amazon Redshift.


About the Authors

Tricia JamisonTricia Jamison is a Senior Prototyping Architect on the AWS Prototyping and Cloud Acceleration (PACE) Team, where she helps AWS customers implement innovative solutions to challenging problems with machine learning, internet of things (IoT), and serverless technologies. She lives in New York City and enjoys basketball, long distance treks, and staying one step ahead of her children.

Neelam Koshiya is an Enterprise Solutions Architect at AWS. With a background in software engineering, she organically moved into an architecture role. Her current focus is helping enterprise customers with their cloud adoption journey for strategic business outcomes with the area of depth being AI/ML. She is passionate about innovation and inclusion. In her spare time, she enjoys reading and being outdoors.

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Read More

English learners can now practice speaking on Search

English learners can now practice speaking on Search

Learning a language can open up new opportunities in a person’s life. It can help people connect with those from different cultures, travel the world, and advance their career. English alone is estimated to have 1.5 billion learners worldwide. Yet proficiency in a new language is difficult to achieve, and many learners cite a lack of opportunity to practice speaking actively and receiving actionable feedback as a barrier to learning.

We are excited to announce a new feature of Google Search that helps people practice speaking and improve their language skills. Within the next few days, Android users in Argentina, Colombia, India (Hindi), Indonesia, Mexico, and Venezuela can get even more language support from Google through interactive speaking practice in English — expanding to more countries and languages in the future. Google Search is already a valuable tool for language learners, providing translations, definitions, and other resources to improve vocabulary. Now, learners translating to or from English on their Android phones will find a new English speaking practice experience with personalized feedback.

A new feature of Google Search allows learners
to practice speaking words in context.

Learners are presented with real-life prompts and then form their own spoken answers using a provided vocabulary word. They engage in practice sessions of 3-5 minutes, getting personalized feedback and the option to sign up for daily reminders to keep practicing. With only a smartphone and some quality time, learners can practice at their own pace, anytime, anywhere.

Activities with personalized feedback, to supplement existing learning tools

Designed to be used alongside other learning services and resources, like personal tutoring, mobile apps, and classes, the new speaking practice feature on Google Search is another tool to assist learners on their journey.

We have partnered with linguists, teachers, and ESL/EFL pedagogical experts to create a speaking practice experience that is effective and motivating. Learners practice vocabulary in authentic contexts, and material is repeated over dynamic intervals to increase retention — approaches that are known to be effective in helping learners become confident speakers. As one partner of ours shared:

“Speaking in a given context is a skill that language learners often lack the opportunity to practice. Therefore this tool is very useful to complement classes and other resources.” – Judit Kormos, Professor, Lancaster University

We are also excited to be working with several language learning partners to surface content they are helping create and to connect them with learners around the world. We look forward to expanding this program further and working with any interested partner.

Personalized real-time feedback

Every learner is different, so delivering personalized feedback in real time is a key part of effective practice. Responses are analyzed to provide helpful, real-time suggestions and corrections.

The system gives semantic feedback, indicating whether their response was relevant to the question and may be understood by a conversation partner. Grammar feedback provides insights into possible grammatical improvements, and a set of example answers at varying levels of language complexity give concrete suggestions for alternative ways to respond in this context.

The feedback is composed of three elements: Semantic analysis, grammar correction, and example answers.

Contextual translation

Among the several new technologies we developed, contextual translation provides the ability to translate individual words and phrases in context. During practice sessions, learners can tap on any word they don’t understand to see the translation of that word considering its context.

Example of contextual translation feature.

This is a difficult technical task, since individual words in isolation often have multiple alternative meanings, and multiple words can form clusters of meaning that need to be translated in unison. Our novel approach translates the entire sentence, then estimates how the words in the original and the translated text relate to each other. This is commonly known as the word alignment problem.

Example of a translated sentence pair and its word alignment. A deep learning alignment model connects the different words that create the meaning to suggest a translation.

The key technology piece that enables this functionality is a novel deep learning model developed in collaboration with the Google Translate team, called Deep Aligner. The basic idea is to take a multilingual language model trained on hundreds of languages, then fine-tune a novel alignment model on a set of word alignment examples (see the figure above for an example) provided by human experts, for several language pairs. From this, the single model can then accurately align any language pair, reaching state-of-the-art alignment error rate (AER, a metric to measure the quality of word alignments, where lower is better). This single new model has led to dramatic improvements in alignment quality across all tested language pairs, reducing average AER from 25% to 5% compared to alignment approaches based on Hidden Markov models (HMMs).

Alignment error rates (lower is better) between English (EN) and other languages.

This model is also incorporated into Google’s translation APIs, greatly improving, for example, the formatting of translated PDFs and websites in Chrome, the translation of YouTube captions, and enhancing Google Cloud’s translation API.

Grammar feedback

To enable grammar feedback for accented spoken language, our research teams adapted grammar correction models for written text (see the blog and paper) to work on automatic speech recognition (ASR) transcriptions, specifically for the case of accented speech. The key step was fine-tuning the written text model on a corpus of human and ASR transcripts of accented speech, with expert-provided grammar corrections. Furthermore, inspired by previous work, the teams developed a novel edit-based output representation that leverages the high overlap between the inputs and outputs that is particularly well-suited for short input sentences common in language learning settings.

The edit representation can be explained using an example:

  • Input: I1 am2 so3 bad4 cooking5
  • Correction: I1 am2 so3 bad4 at5 cooking6
  • Edits: (‘at’, 4, PREPOSITION, 4)

In the above, “at” is the word that is inserted at position 4 and “PREPOSITION” denotes this is an error involving prepositions. We used the error tag to select tag-dependent acceptance thresholds that improved the model further. The model increased the recall of grammar problems from 4.6% to 35%.

Some example output from our model and a model trained on written corpora:

    Example 1     Example 2
User input (transcribed speech)

I live of my profession. I need a efficient card and reliable.
Text-based grammar model

I live by my profession. I need an efficient card and a reliable.
New speech-optimized model

I live off my profession. I need an efficient and reliable card.

Semantic analysis

A primary goal of conversation is to communicate one’s intent clearly. Thus, we designed a feature that visually communicates to the learner whether their response was relevant to the context and would be understood by a partner. This is a difficult technical problem, since early language learners’ spoken responses can be syntactically unconventional. We had to carefully balance this technology to focus on the clarity of intent rather than correctness of syntax.

Our system utilizes a combination of two approaches:

  1. Sensibility classification: Large language models like LaMDA or PaLM are designed to give natural responses in a conversation, so it’s no surprise that they do well on the reverse: judging whether a given response is contextually sensible.
  2. Similarity to good responses: We used an encoder architecture to compare the learner’s input to a set of known good responses in a semantic embedding space. This comparison provides another useful signal on semantic relevance, further improving the quality of feedback and suggestions we provide.
The system provides feedback about whether the response was relevant to the prompt, and would be understood by a communication partner.

ML-assisted content development

Our available practice activities present a mix of human-expert created content, and content that was created with AI assistance and human review. This includes speaking prompts, focus words, as well as sets of example answers that showcase meaningful and contextual responses.

A list of example answers is provided when the learner receives feedback and when they tap the help button.

Since learners have different levels of ability, the language complexity of the content has to be adjusted appropriately. Prior work on language complexity estimation focuses on text of paragraph length or longer, which differs significantly from the type of responses that our system processes. Thus, we developed novel models that can estimate the complexity of a single sentence, phrase, or even individual words. This is challenging because even a phrase composed of simple words can be hard for a language learner (e.g., “Let’s cut to the chase”). Our best model is based on BERT and achieves complexity predictions closest to human expert consensus. The model was pre-trained using a large set of LLM-labeled examples, and then fine-tuned using a human expert–labeled dataset.

Mean squared error of various approaches’ performance estimating content difficulty on a diverse corpus of ~450 conversational passages (text / transcriptions). Top row: Human raters labeled the items on a scale from 0.0 to 5.0, roughly aligned to the CEFR scale (from A1 to C2). Bottom four rows: Different models performed the same task, and we show the difference to the human expert consensus.

Using this model, we can evaluate the difficulty of text items, offer a diverse range of suggestions, and most importantly challenge learners appropriately for their ability levels. For example, using our model to label examples, we can fine-tune our system to generate speaking prompts at various language complexity levels.

Vocabulary focus words, to be elicited by the questions
    guitar     apple     lion
Simple     What do you like to play?     Do you like fruit?     Do you like big cats?
Intermediate     Do you play any musical instruments?     What is your favorite fruit?     What is your favorite animal?
Complex     What stringed instrument do you enjoy playing?     Which type of fruit do you enjoy eating for its crunchy texture and sweet flavor?     Do you enjoy watching large, powerful predators?

Furthermore, content difficulty estimation is used to gradually increase the task difficulty over time, adapting to the learner’s progress.

Conclusion

With these latest updates, which will roll out over the next few days, Google Search has become even more helpful. If you are an Android user in India (Hindi), Indonesia, Argentina, Colombia, Mexico, or Venezuela, give it a try by translating to or from English with Google.

We look forward to expanding to more countries and languages in the future, and to start offering partner practice content soon.

Acknowledgements

Many people were involved in the development of this project. Among many others, we thank our external advisers in the language learning field: Jeffrey Davitz, Judit Kormos, Deborah Healey, Anita Bowles, Susan Gaer, Andrea Revesz, Bradley Opatz, and Anne Mcquade.

Read More

What’s Your Story: Ranveer Chandra

What’s Your Story: Ranveer Chandra

MSR Podcast

In this new Microsoft Research Podcast series What’s Your Story, Lab Director Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. He talks to members of the research community at Microsoft about what motivates their work and how they got where they are today. 

Ranveer Chandra is Managing Director of Research for Industry and CTO of Agri-Food. He is also head of Networking Research at Microsoft Research Redmond. His work in systems and networking is helping to bring more internet connectivity to more people and is yielding tools designed to help farmers increase food production more affordably and sustainably. In this episode, he shares what it was like growing up in Jamshedpur, India; why he focuses his efforts in the areas he does; and where the joy in his work comes from.

The post What’s Your Story: Ranveer Chandra appeared first on Microsoft Research.

Read More

Coming in Clutch: Stream ‘Counter-Strike 2’ From the Cloud for Highest Frame Rates

Coming in Clutch: Stream ‘Counter-Strike 2’ From the Cloud for Highest Frame Rates

Rush to the cloud — stream Counter-Strike 2 on GeForce NOW for the highest frame rates. Members can play through the newest chapter of Valve’s elite, competitive, first-person shooter from the cloud.

It’s all part of an action-packed GFN Thursday, with 22 more games joining the cloud gaming platform’s library, including Hot Wheels Unleashed 2 – Turbocharged.

“Rush B! Rush B!”

 

Counter-Strike 2 is the long-awaited upgrade to one of the most recognizable competitive first-person shooters in the world.

Building on the legacy of Counter-Strike: Global Offensive, the latest iteration brings the action to Valve’s long-anticipated Source 2 video game engine, promising enhanced graphical fidelity with a physically based rendering system for more realistic textures and materials, dynamic lighting, reflections and more.

Smoke grenades are now dynamic volumetric objects that can interact with their surroundings by reacting to lighting and other environmental effects. And smoke particles work with the unified lighting system, allowing for more realistic light and color.

Even better: GeForce NOW Ultimate members can take full advantage of NVIDIA Reflex for ultra-low-latency gameplay streaming from the cloud. Rush the objective with the squad on Counter-Strike 2’s remastered maps at up to 240 frames per second — a first for cloud gaming. Upgrade today for the Ultimate Counter-Strike experience.

Vroom, Vroom!

We’re going turbo.

There’s more action around every turn of the GeForce NOW library. Put the pedal to the metal in Hot Wheels Unleashed 2 – Turbocharged, one of 22 newly supported games joining this week:

  • Wizard With a Gun (New release on Steam, Oct. 17)
  • Alaskan Road Truckers (New release on Steam, Oct. 18)
  • Hellboy: Web of Wyrd (New release on Steam, Oct. 18)
  • AirportSim (New release on Steam, Oct. 19)
  • Eternal Threads (New release on Epic Games Store, Oct. 19)
  • Hot Wheels Unleashed 2 – Turbocharged (New release on Steam, Oct. 19)
  • Laika Aged Through Blood (New release on Steam, Oct. 19)
  • Battle Chasers: Nightwar (Xbox, available on Microsoft Store)
  • Black Skylands (Xbox, available on Microsoft Store)
  • Blair Witch (Xbox, available on Microsoft Store)
  • Chicory: A Colorful Tale (Xbox and available on PC Game Pass)
  • Dead by Daylight (Xbox and available on PC Game Pass)
  • Dune: Spice Wars (Xbox and available on PC Game Pass)
  • Everspace 2 (Xbox and available on PC Game Pass)
  • EXAPUNKS (Xbox and available on PC Game Pass)
  • Gungrave G.O.R.E (Xbox and available on PC Game Pass)
  • Railway Empire 2 (Xbox and available on PC Game Pass)
  • Techtonica (Xbox and available on PC Game Pass)
  • Teenage Mutant Ninja Turtles: Shredder’s Revenge (Xbox and available on PC Game Pass)
  • Torchlight III (Xbox and available on PC Game Pass)
  • Trine 5: A Clockwork Conspiracy (Epic Games Store)
  • Vampire Survivors (Xbox, available on PC Game Pass)

What are you planning to play this weekend? Let us know on Twitter or in the comments below.

Read More

Evaluating social and ethical risks from generative AI

Evaluating social and ethical risks from generative AI

Generative AI systems are already being used to write books, create graphic designs, assist medical practitioners, and are becoming increasingly capable. To ensure these systems are developed and deployed responsibly requires carefully evaluating the potential ethical and social risks they may pose.In our paper, we propose a three-layered framework for evaluating the social and ethical risks of AI systems. This framework includes evaluations of AI system capability, human interaction, and systemic impacts.We also map the current state of safety evaluations and find three main gaps: context, specific risks, and multimodality. To help close these gaps, we call for repurposing existing evaluation methods for generative AI and for implementing a comprehensive approach to evaluation, as in our case study on misinformation. This approach integrates findings like how likely the AI system is to provide factually incorrect information, with insights on how people use that system, and in what context. Multi-layered evaluations can draw conclusions beyond model capability and indicate whether harm — in this case, misinformation — actually occurs and spreads. To make any technology work as intended, both social and technical challenges must be solved. So to better assess AI system safety, these different layers of context must be taken into account. Here, we build upon earlier research identifying the potential risks of large-scale language models, such as privacy leaks, job automation, misinformation, and more — and introduce a way of comprehensively evaluating these risks going forward.Read More

Measurement-induced entanglement phase transitions in a quantum circuit

Measurement-induced entanglement phase transitions in a quantum circuit

Quantum mechanics allows many phenomena that are classically impossible: a quantum particle can exist in a superposition of two states simultaneously or be entangled with another particle, such that anything you do to one seems to instantaneously also affect the other, regardless of the space between them. But perhaps no aspect of quantum theory is as striking as the act of measurement. In classical mechanics, a measurement need not affect the system being studied. But a measurement on a quantum system can profoundly influence its behavior. For example, when a quantum bit of information, called a qubit, that is in a superposition of both “0” and “1” is measured, its state will suddenly collapse to one of the two classically allowed states: it will be either “0” or “1,” but not both. This transition from the quantum to classical worlds seems to be facilitated by the act of measurement. How exactly it occurs is one of the fundamental unanswered questions in physics.

In a large system comprising many qubits, the effect of measurements can cause new phases of quantum information to emerge. Similar to how changing parameters such as temperature and pressure can cause a phase transition in water from liquid to solid, tuning the strength of measurements can induce a phase transition in the entanglement of qubits.

Today in “Measurement-induced entanglement and teleportation on a noisy quantum processor”, published in Nature, we describe experimental observations of measurement-induced effects in a system of 70 qubits on our Sycamore quantum processor. This is, by far, the largest system in which such a phase transition has been observed. Additionally, we detected “quantum teleportation” — when a quantum state is transferred from one set of qubits to another, detectable even if the details of that state are unknown — which emerged from measurements of a random circuit. We achieved this breakthrough by implementing a few clever “tricks” to more readily see the signatures of measurement-induced effects in the system.

Background: Measurement-induced entanglement

Consider a system of qubits that start out independent and unentangled with one another. If they interact with one another , they will become entangled. You can imagine this as a web, where the strands represent the entanglement between qubits. As time progresses, this web grows larger and more intricate, connecting increasingly disparate points together.

A full measurement of the system completely destroys this web, since every entangled superposition of qubits collapses when it’s measured. But what happens when we make a measurement on only a few of the qubits? Or if we wait a long time between measurements? During the intervening time, entanglement continues to grow. The web’s strands may not extend as vastly as before, but there are still patterns in the web.

There is a balancing point between the strength of interactions and measurements, which compete to affect the intricacy of the web. When interactions are strong and measurements are weak, entanglement remains robust and the web’s strands extend farther, but when measurements begin to dominate, the entanglement web is destroyed. We call the crossover between these two extremes the measurement-induced phase transition.

In our quantum processor, we observe this measurement-induced phase transition by varying the relative strengths between interactions and measurement. We induce interactions by performing entangling operations on pairs of qubits. But to actually see this web of entanglement in an experiment is notoriously challenging. First, we can never actually look at the strands connecting the qubits — we can only infer their existence by seeing statistical correlations between the measurement outcomes of the qubits. So, we need to repeat the same experiment many times to infer the pattern of the web. But there’s another complication: the web pattern is different for each possible measurement outcome. Simply averaging all of the experiments together without regard for their measurement outcomes would wash out the webs’ patterns. To address this, some previous experiments used “post-selection,” where only data with a particular measurement outcome is used and the rest is thrown away. This, however, causes an exponentially decaying bottleneck in the amount of “usable” data you can acquire. In addition, there are also practical challenges related to the difficulty of mid-circuit measurements with superconducting qubits and the presence of noise in the system.

How we did it

To address these challenges, we introduced three novel tricks to the experiment that enabled us to observe measurement-induced dynamics in a system of up to 70 qubits.

Trick 1: Space and time are interchangeable

As counterintuitive as it may seem, interchanging the roles of space and time dramatically reduces the technical challenges of the experiment. Before this “space-time duality” transformation, we would have had to interleave measurements with other entangling operations, frequently checking the state of selected qubits. Instead, after the transformation, we can postpone all measurements until after all other operations, which greatly simplifies the experiment. As implemented here, this transformation turns the original 1-spatial-dimensional circuit we were interested in studying into a 2-dimensional one. Additionally, since all measurements are now at the end of the circuit, the relative strength of measurements and entangling interactions is tuned by varying the number of entangling operations performed in the circuit.

Exchanging space and time. To avoid the complication of interleaving measurements into our experiment (shown as gauges in the left panel), we utilize a space-time duality mapping to exchange the roles of space and time. This mapping transforms the 1D circuit (left) into a 2D circuit (right), where the circuit depth (T) now tunes the effective measurement rate.

Trick 2: Overcoming the post-selection bottleneck

Since each combination of measurement outcomes on all of the qubits results in a unique web pattern of entanglement, researchers often use post-selection to examine the details of a particular web. However, because this method is very inefficient, we developed a new “decoding” protocol that compares each instance of the real “web” of entanglement to the same instance in a classical simulation. This avoids post-selection and is sensitive to features that are common to all of the webs. This common feature manifests itself into a combined classical–quantum “order parameter”, akin to the cross-entropy benchmark used in the random circuit sampling used in our beyond-classical demonstration.

This order parameter is calculated by selecting one of the qubits in the system as the “probe” qubit, measuring it, and then using the measurement record of the nearby qubits to classically “decode” what the state of the probe qubit should be. By cross-correlating the measured state of the probe with this “decoded” prediction, we can obtain the entanglement between the probe qubit and the rest of the (unmeasured) qubits. This serves as an order parameter, which is a proxy for determining the entanglement characteristics of the entire web.

In the decoding procedure we choose a “probe” qubit (pink) and classically compute its expected value, conditional on the measurement record of the surrounding qubits (yellow). The order parameter is then calculated by the cross correlation between the measured probe bit and the classically computed value.

Trick 3: Using noise to our advantage

A key feature of the so-called “disentangling phase” — where measurements dominate and entanglement is less widespread — is its insensitivity to noise. We can therefore look at how the probe qubit is affected by noise in the system and use that to differentiate between the two phases. In the disentangling phase, the probe will be sensitive only to local noise that occurs within a particular area near the probe. On the other hand, in the entangling phase, any noise in the system can affect the probe qubit. In this way, we are turning something that is normally seen as a nuisance in experiments into a unique probe of the system.

What we saw

We first studied how the order parameter was affected by noise in each of the two phases. Since each of the qubits is noisy, adding more qubits to the system adds more noise. Remarkably, we indeed found that in the disentangling phase the order parameter is unaffected by adding more qubits to the system. This is because, in this phase, the strands of the web are very short, so the probe qubit is only sensitive to the noise of its nearest qubits. In contrast, we found that in the entangling phase, where the strands of the entanglement web stretch longer, the order parameter is very sensitive to the size of the system, or equivalently, the amount of noise in the system. The transition between these two sharply contrasting behaviors indicates a transition in the entanglement character of the system as the “strength” of measurement is increased.

Order parameter vs. gate density (number of entangling operations) for different numbers of qubits. When the number of entangling operations is low, measurements play a larger role in limiting the entanglement across the system. When the number of entangling operations is high, entanglement is widespread, which results in the dependence of the order parameter on system size (inset).

In our experiment, we also demonstrated a novel form of quantum teleportation that arises in the entangling phase. Typically, a specific set of operations are necessary to implement quantum teleportation, but here, the teleportation emerges from the randomness of the non-unitary dynamics. When all qubits, except the probe and another system of far away qubits, are measured, the remaining two systems are strongly entangled with each other. Without measurement, these two systems of qubits would be too far away from each other to know about the existence of each other. With measurements, however, entanglement can be generated faster than the limits typically imposed by locality and causality. This “measurement-induced entanglement” between the qubits (that must also be aided with a classical communications channel) is what allows for quantum teleportation to occur.

Proxy entropy vs. gate density for two far separated subsystems (pink and black qubits) when all other qubits are measured. There is a finite-size crossing at ~0.9. Above this gate density, the probe qubit is entangled with qubits on the opposite side of the system and is a signature of the teleporting phase.

Conclusion

Our experiments demonstrate the effect of measurements on a quantum circuit. We show that by tuning the strength of measurements, we can induce transitions to new phases of quantum entanglement within the system and even generate an emergent form of quantum teleportation. This work could potentially have relevance to quantum computing schemes, where entanglement and measurements both play a role.

Acknowledgements

This work was done while Jesse Hoke was interning at Google from Stanford University. We would like to thank Katie McCormick, our Quantum Science Communicator, for helping to write this blog post.

Read More