Why detecting damage is so tricky at Amazon’s scale — and how researchers are training robots to help with that gargantuan task.Read More
Keeping Learning-Based Control Safe by Regulating Distributional Shift
To regulate the distribution shift experience by learning-based controllers, we seek a mechanism for constraining the agent to regions of high data density throughout its trajectory (left). Here, we present an approach which achieves this goal by combining features of density models (middle) and Lyapunov functions (right).
In order to make use of machine learning and reinforcement learning in controlling real world systems, we must design algorithms which not only achieve good performance, but also interact with the system in a safe and reliable manner. Most prior work on safety-critical control focuses on maintaining the safety of the physical system, e.g. avoiding falling over for legged robots, or colliding into obstacles for autonomous vehicles. However, for learning-based controllers, there is another source of safety concern: because machine learning models are only optimized to output correct predictions on the training data, they are prone to outputting erroneous predictions when evaluated on out-of-distribution inputs. Thus, if an agent visits a state or takes an action that is very different from those in the training data, a learning-enabled controller may “exploit” the inaccuracies in its learned component and output actions that are suboptimal or even dangerous.
Keeping Learning-Based Control Safe by Regulating Distributional Shift
To regulate the distribution shift experience by learning-based controllers, we seek a mechanism for constraining the agent to regions of high data density throughout its trajectory (left). Here, we present an approach which achieves this goal by combining features of density models (middle) and Lyapunov functions (right).
In order to make use of machine learning and reinforcement learning in controlling real world systems, we must design algorithms which not only achieve good performance, but also interact with the system in a safe and reliable manner. Most prior work on safety-critical control focuses on maintaining the safety of the physical system, e.g. avoiding falling over for legged robots, or colliding into obstacles for autonomous vehicles. However, for learning-based controllers, there is another source of safety concern: because machine learning models are only optimized to output correct predictions on the training data, they are prone to outputting erroneous predictions when evaluated on out-of-distribution inputs. Thus, if an agent visits a state or takes an action that is very different from those in the training data, a learning-enabled controller may “exploit” the inaccuracies in its learned component and output actions that are suboptimal or even dangerous.
Protecting maternal health in Rwanda
The world is facing a maternal health crisis. According to the World Health Organization, approximately 810 women die each day due to preventable causes related to pregnancy and childbirth. Two-thirds of these deaths occur in sub-Saharan Africa. In Rwanda, one of the leading causes of maternal mortality is infected Cesarean section wounds.
An interdisciplinary team of doctors and researchers from MIT, Harvard University, and Partners in Health (PIH) in Rwanda have proposed a solution to address this problem. They have developed a mobile health (mHealth) platform that uses artificial intelligence and real-time computer vision to predict infection in C-section wounds with roughly 90 percent accuracy.
“Early detection of infection is an important issue worldwide, but in low-resource areas such as rural Rwanda, the problem is even more dire due to a lack of trained doctors and the high prevalence of bacterial infections that are resistant to antibiotics,” says Richard Ribon Fletcher ’89, SM ’97, PhD ’02, research scientist in mechanical engineering at MIT and technology lead for the team. “Our idea was to employ mobile phones that could be used by community health workers to visit new mothers in their homes and inspect their wounds to detect infection.”
This summer, the team, which is led by Bethany Hedt-Gauthier, a professor at Harvard Medical School, was awarded the $500,000 first-place prize in the NIH Technology Accelerator Challenge for Maternal Health.
“The lives of women who deliver by Cesarean section in the developing world are compromised by both limited access to quality surgery and postpartum care,” adds Fredrick Kateera, a team member from PIH. “Use of mobile health technologies for early identification, plausible accurate diagnosis of those with surgical site infections within these communities would be a scalable game changer in optimizing women’s health.”
Training algorithms to detect infection
The project’s inception was the result of several chance encounters. In 2017, Fletcher and Hedt-Gauthier bumped into each other on the Washington Metro during an NIH investigator meeting. Hedt-Gauthier, who had been working on research projects in Rwanda for five years at that point, was seeking a solution for the gap in Cesarean care she and her collaborators had encountered in their research. Specifically, she was interested in exploring the use of cell phone cameras as a diagnostic tool.
Fletcher, who leads a group of students in Professor Sanjay Sarma’s AutoID Lab and has spent decades applying phones, machine learning algorithms, and other mobile technologies to global health, was a natural fit for the project.
“Once we realized that these types of image-based algorithms could support home-based care for women after Cesarean delivery, we approached Dr. Fletcher as a collaborator, given his extensive experience in developing mHealth technologies in low- and middle-income settings,” says Hedt-Gauthier.
During that same trip, Hedt-Gauthier serendipitously sat next to Audace Nakeshimana ’20, who was a new MIT student from Rwanda and would later join Fletcher’s team at MIT. With Fletcher’s mentorship, during his senior year, Nakeshimana founded Insightiv, a Rwandan startup that is applying AI algorithms for analysis of clinical images, and was a top grant awardee at the annual MIT IDEAS competition in 2020.
The first step in the project was gathering a database of wound images taken by community health workers in rural Rwanda. They collected over 1,000 images of both infected and non-infected wounds and then trained an algorithm using that data.
A central problem emerged with this first dataset, collected between 2018 and 2019. Many of the photographs were of poor quality.
“The quality of wound images collected by the health workers was highly variable and it required a large amount of manual labor to crop and resample the images. Since these images are used to train the machine learning model, the image quality and variability fundamentally limits the performance of the algorithm,” says Fletcher.
To solve this issue, Fletcher turned to tools he used in previous projects: real-time computer vision and augmented reality.
Improving image quality with real-time image processing
To encourage community health workers to take higher-quality images, Fletcher and the team revised the wound screener mobile app and paired it with a simple paper frame. The frame contained a printed calibration color pattern and another optical pattern that guides the app’s computer vision software.
Health workers are instructed to place the frame over the wound and open the app, which provides real-time feedback on the camera placement. Augmented reality is used by the app to display a green check mark when the phone is in the proper range. Once in range, other parts of the computer vision software will then automatically balance the color, crop the image, and apply transformations to correct for parallax.
“By using real-time computer vision at the time of data collection, we are able to generate beautiful, clean, uniform color-balanced images that can then be used to train our machine learning models, without any need for manual data cleaning or post-processing,” says Fletcher.
Using convolutional neural net (CNN) machine learning models, along with a method called transfer learning, the software has been able to successfully predict infection in C-section wounds with roughly 90 percent accuracy within 10 days of childbirth. Women who are predicted to have an infection through the app are then given a referral to a clinic where they can receive diagnostic bacterial testing and can be prescribed life-saving antibiotics as needed.
The app has been well received by women and community health workers in Rwanda.
“The trust that women have in community health workers, who were a big promoter of the app, meant the mHealth tool was accepted by women in rural areas,” adds Anne Niyigena of PIH.
Using thermal imaging to address algorithmic bias
One of the biggest hurdles to scaling this AI-based technology to a more global audience is algorithmic bias. When trained on a relatively homogenous population, such as that of rural Rwanda, the algorithm performs as expected and can successfully predict infection. But when images of patients of varying skin colors are introduced, the algorithm is less effective.
To tackle this issue, Fletcher used thermal imaging. Simple thermal camera modules, designed to attach to a cell phone, cost approximately $200 and can be used to capture infrared images of wounds. Algorithms can then be trained using the heat patterns of infrared wound images to predict infection. A study published last year showed over a 90 percent prediction accuracy when these thermal images were paired with the app’s CNN algorithm.
While more expensive than simply using the phone’s camera, the thermal image approach could be used to scale the team’s mHealth technology to a more diverse, global population.
“We’re giving the health staff two options: in a homogenous population, like rural Rwanda, they can use their standard phone camera, using the model that has been trained with data from the local population. Otherwise, they can use the more general model which requires the thermal camera attachment,” says Fletcher.
While the current generation of the mobile app uses a cloud-based algorithm to run the infection prediction model, the team is now working on a stand-alone mobile app that does not require internet access, and also looks at all aspects of maternal health, from pregnancy to postpartum.
In addition to developing the library of wound images used in the algorithms, Fletcher is working closely with former student Nakeshimana and his team at Insightiv on the app’s development, and using the Android phones that are locally manufactured in Rwanda. PIH will then conduct user testing and field-based validation in Rwanda.
As the team looks to develop the comprehensive app for maternal health, privacy and data protection are a top priority.
“As we develop and refine these tools, a closer attention must be paid to patients’ data privacy. More data security details should be incorporated so that the tool addresses the gaps it is intended to bridge and maximizes user’s trust, which will eventually favor its adoption at a larger scale,” says Niyigena.
Members of the prize-winning team include: Bethany Hedt-Gauthier from Harvard Medical School; Richard Fletcher from MIT; Robert Riviello from Brigham and Women’s Hospital; Adeline Boatin from Massachusetts General Hospital; Anne Niyigena, Frederick Kateera, Laban Bikorimana, and Vincent Cubaka from PIH in Rwanda; and Audace Nakeshimana ’20, founder of Insightiv.ai.
Protecting maternal health in Rwanda
The world is facing a maternal health crisis. According to the World Health Organization, approximately 810 women die each day due to preventable causes related to pregnancy and childbirth. Two-thirds of these deaths occur in sub-Saharan Africa. In Rwanda, one of the leading causes of maternal mortality is infected Cesarean section wounds.
An interdisciplinary team of doctors and researchers from MIT, Harvard University, and Partners in Health (PIH) in Rwanda have proposed a solution to address this problem. They have developed a mobile health (mHealth) platform that uses artificial intelligence and real-time computer vision to predict infection in C-section wounds with roughly 90 percent accuracy.
“Early detection of infection is an important issue worldwide, but in low-resource areas such as rural Rwanda, the problem is even more dire due to a lack of trained doctors and the high prevalence of bacterial infections that are resistant to antibiotics,” says Richard Ribon Fletcher ’89, SM ’97, PhD ’02, research scientist in mechanical engineering at MIT and technology lead for the team. “Our idea was to employ mobile phones that could be used by community health workers to visit new mothers in their homes and inspect their wounds to detect infection.”
This summer, the team, which is led by Bethany Hedt-Gauthier, a professor at Harvard Medical School, was awarded the $500,000 first-place prize in the NIH Technology Accelerator Challenge for Maternal Health.
“The lives of women who deliver by Cesarean section in the developing world are compromised by both limited access to quality surgery and postpartum care,” adds Fredrick Kateera, a team member from PIH. “Use of mobile health technologies for early identification, plausible accurate diagnosis of those with surgical site infections within these communities would be a scalable game changer in optimizing women’s health.”
Training algorithms to detect infection
The project’s inception was the result of several chance encounters. In 2017, Fletcher and Hedt-Gauthier bumped into each other on the Washington Metro during an NIH investigator meeting. Hedt-Gauthier, who had been working on research projects in Rwanda for five years at that point, was seeking a solution for the gap in Cesarean care she and her collaborators had encountered in their research. Specifically, she was interested in exploring the use of cell phone cameras as a diagnostic tool.
Fletcher, who leads a group of students in Professor Sanjay Sarma’s AutoID Lab and has spent decades applying phones, machine learning algorithms, and other mobile technologies to global health, was a natural fit for the project.
“Once we realized that these types of image-based algorithms could support home-based care for women after Cesarean delivery, we approached Dr. Fletcher as a collaborator, given his extensive experience in developing mHealth technologies in low- and middle-income settings,” says Hedt-Gauthier.
During that same trip, Hedt-Gauthier serendipitously sat next to Audace Nakeshimana ’20, who was a new MIT student from Rwanda and would later join Fletcher’s team at MIT. With Fletcher’s mentorship, during his senior year, Nakeshimana founded Insightiv, a Rwandan startup that is applying AI algorithms for analysis of clinical images, and was a top grant awardee at the annual MIT IDEAS competition in 2020.
The first step in the project was gathering a database of wound images taken by community health workers in rural Rwanda. They collected over 1,000 images of both infected and non-infected wounds and then trained an algorithm using that data.
A central problem emerged with this first dataset, collected between 2018 and 2019. Many of the photographs were of poor quality.
“The quality of wound images collected by the health workers was highly variable and it required a large amount of manual labor to crop and resample the images. Since these images are used to train the machine learning model, the image quality and variability fundamentally limits the performance of the algorithm,” says Fletcher.
To solve this issue, Fletcher turned to tools he used in previous projects: real-time computer vision and augmented reality.
Improving image quality with real-time image processing
To encourage community health workers to take higher-quality images, Fletcher and the team revised the wound screener mobile app and paired it with a simple paper frame. The frame contained a printed calibration color pattern and another optical pattern that guides the app’s computer vision software.
Health workers are instructed to place the frame over the wound and open the app, which provides real-time feedback on the camera placement. Augmented reality is used by the app to display a green check mark when the phone is in the proper range. Once in range, other parts of the computer vision software will then automatically balance the color, crop the image, and apply transformations to correct for parallax.
“By using real-time computer vision at the time of data collection, we are able to generate beautiful, clean, uniform color-balanced images that can then be used to train our machine learning models, without any need for manual data cleaning or post-processing,” says Fletcher.
Using convolutional neural net (CNN) machine learning models, along with a method called transfer learning, the software has been able to successfully predict infection in C-section wounds with roughly 90 percent accuracy within 10 days of childbirth. Women who are predicted to have an infection through the app are then given a referral to a clinic where they can receive diagnostic bacterial testing and can be prescribed life-saving antibiotics as needed.
The app has been well received by women and community health workers in Rwanda.
“The trust that women have in community health workers, who were a big promoter of the app, meant the mHealth tool was accepted by women in rural areas,” adds Anne Niyigena of PIH.
Using thermal imaging to address algorithmic bias
One of the biggest hurdles to scaling this AI-based technology to a more global audience is algorithmic bias. When trained on a relatively homogenous population, such as that of rural Rwanda, the algorithm performs as expected and can successfully predict infection. But when images of patients of varying skin colors are introduced, the algorithm is less effective.
To tackle this issue, Fletcher used thermal imaging. Simple thermal camera modules, designed to attach to a cell phone, cost approximately $200 and can be used to capture infrared images of wounds. Algorithms can then be trained using the heat patterns of infrared wound images to predict infection. A study published last year showed over a 90 percent prediction accuracy when these thermal images were paired with the app’s CNN algorithm.
While more expensive than simply using the phone’s camera, the thermal image approach could be used to scale the team’s mHealth technology to a more diverse, global population.
“We’re giving the health staff two options: in a homogenous population, like rural Rwanda, they can use their standard phone camera, using the model that has been trained with data from the local population. Otherwise, they can use the more general model which requires the thermal camera attachment,” says Fletcher.
While the current generation of the mobile app uses a cloud-based algorithm to run the infection prediction model, the team is now working on a stand-alone mobile app that does not require internet access, and also looks at all aspects of maternal health, from pregnancy to postpartum.
In addition to developing the library of wound images used in the algorithms, Fletcher is working closely with former student Nakeshimana and his team at Insightiv on the app’s development, and using the Android phones that are locally manufactured in Rwanda. PIH will then conduct user testing and field-based validation in Rwanda.
As the team looks to develop the comprehensive app for maternal health, privacy and data protection are a top priority.
“As we develop and refine these tools, a closer attention must be paid to patients’ data privacy. More data security details should be incorporated so that the tool addresses the gaps it is intended to bridge and maximizes user’s trust, which will eventually favor its adoption at a larger scale,” says Niyigena.
Members of the prize-winning team include: Bethany Hedt-Gauthier from Harvard Medical School; Richard Fletcher from MIT; Robert Riviello from Brigham and Women’s Hospital; Adeline Boatin from Massachusetts General Hospital; Anne Niyigena, Frederick Kateera, Laban Bikorimana, and Vincent Cubaka from PIH in Rwanda; and Audace Nakeshimana ’20, founder of Insightiv.ai.
Google at Interspeech 2022
This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.
We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).
Organizing Committee
Industry Liaisons include: Bhuvana Ramabahdran
Area Chairs include: John Hershey, Heiga Zen, Shrikanth Narayanan, Bastiaan Kleijn
ISCA Fellows
Include: Tara Sainath, Heiga Zen
Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training
Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays
Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar
UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey
SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani
Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He
Streaming Intended Query Detection Using E2E Modeling for Continued Conversation
Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman
Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee
XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays
Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews
AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module
Richard Rose, Olivier Siohan
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani
Non-parallel Voice Conversion for ASR Augmentation
Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno
Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund
Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani
Improving Deliberation by Text-Only and Semi-supervised Training
Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu
CycleGAN-Based Unpaired Speech Dereverberation
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post)
Joel Shor, Subhashini Venugopalan
Learning Neural Audio Features Without Supervision
Sarthak Yadav, Neil Zeghidour
SpeechPainter: Text-Conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi
SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey
Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno
Improving Rare Word Recognition with LM-Aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach
MAESTRO: Matched Speech Text Representations Through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen
Pseudo Label is Better Than Human Label
Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman
On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran
Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath
Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin*, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman
4-Bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov
Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs
Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha
A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein
Reducing Domain Mismatch in Self-Supervised Speech Pre-training
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano
On-the-Fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim
A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang*, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
XTREME-S: Evaluating Cross-Lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
Towards Disentangled Speech Representations
Cal Peyser, Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho
Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O’Malley, Ian McGraw
A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O’Malley, Arun Narayanan, Quan Wang
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen*, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark
A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization
Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno
Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi
Workshops, Tutorials & Special Sessions
The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
Organizers include: Arsha Nagrani
Self-Supervised Representation Learning for Speech Processing
Organizers include: Tara Sainath
Learning from Weak Labels
Organizers include: Ankit Shah
RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations
Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey
Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Authors: Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla
Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors: Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines
Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays
Trustworthy Speech Processing
Organizers include: Shrikanth Narayanan
*Work done while at Google. ↩Read More
Robust Online Allocation with Dual Mirror Descent
The emergence of digital technologies has transformed decision making across commercial sectors such as airlines, online retailing, and internet advertising. Today, real-time decisions need to be repeatedly made in highly uncertain and rapidly changing environments. Moreover, organizations usually have limited resources, which need to be efficiently allocated across decisions. Such problems are referred to as online allocation problems with resource constraints, and applications abound. Some examples include:
- Bidding with Budget Constraints: Advertisers increasingly purchase ad slots using auction-based marketplaces such as search engines and ad exchanges. A typical advertiser can participate in a large number of auctions in a given month. Because the supply in these marketplaces is uncertain, advertisers set budgets to control their total spend. Therefore, advertisers need to determine how to optimally place bids while limiting total spend and maximizing conversions.
- Dynamic Ad Allocation: Publishers can monetize their websites by signing deals with advertisers guaranteeing a number of impressions or by auctioning off slots in the open market. To make this choice, publishers need to trade off, in real-time, the short-term revenue from selling slots in the open market and the long-term benefits of delivering good quality spots to reservation ads.
- Airline Revenue Management: Planes have a limited number of seats that need to be filled up as much as possible before a flight’s departure. But demand for flights changes over time and airlines would like to sell airline tickets to the customers who are willing to pay the most. Thus, airlines have increasingly adopted sophisticated automated systems to manage the pricing and availability of airline tickets.
- Personalized Retailing with Limited Inventories: Online retailers can use real-time data to personalize their offerings to customers who visit their store. Because product inventory is limited and cannot be easily replenished, retailers need to dynamically decide which products to offer and at what price to maximize their revenue while satisfying their inventory constraints.
The common feature of these problems is the presence of resource constraints (budgets, contractual obligations, seats, or inventory, respectively in the examples above) and the need to make dynamic decisions in environments with uncertainty. Resource constraints are challenging because they link decisions across time — e.g., in the bidding problem, bidding too high early can leave advertisers with no budget, and thus missed opportunities later. Conversely, bidding too conservatively can result in a low number of conversions or clicks.
![]() |
Two central resource allocation problems faced by advertisers and publishers in internet advertising markets. |
In this post, we discuss state-of-the-art algorithms that can help maximize goals in dynamic, resource-constrained environments. In particular, we have recently developed a new class of algorithms for online allocation problems, called dual mirror descent, that are simple, robust, and flexible. Our papers have appeared in Operations Research, ICML’20, and ICML’21, and we have ongoing work to continue progress in this space. Compared to existing approaches, dual mirror descent is faster as it does not require solving auxiliary optimization problems, is more flexible because it can handle many applications across different sectors with minimal modifications, and is more robust as it enjoys remarkable performance under different environments.
Online Allocation Problems
In an online allocation problem, a decision maker has a limited amount of total resources (B) and receives a certain number of requests over time (T). At any point in time (t), the decision maker receives a reward function (ft) and resource consumption function (bt), and takes an action (xt). The reward and resource consumption functions change over time and the objective is to maximize the total reward within the resource constraints. If all the requests were known in advance, then an optimal allocation could be obtained by solving an offline optimization problem for how to maximize the reward function over time within the resource constraints1.
The optimal offline allocation cannot be implemented in practice because it requires knowing future requests. However, this is still useful for framing the goal of online allocation problems: to design an algorithm whose performance is as close to optimal as possible without knowing future requests.
Achieving the Best of Many Worlds with Dual Mirror Descent
A simple, yet powerful idea to handle resource constraints is introducing “prices” for the resources, which enables accounting for the opportunity cost of consuming resources when making decisions. For example, selling a seat on a plane today means it can’t be sold tomorrow. These prices are useful as an internal accounting system of the algorithm. They serve the purpose of coordinating decisions at different moments in time and allow decomposing a complex problem with resource constraints into simpler subproblems: one per time period with no resource constraints. For example, in a bidding problem, the prices capture an advertiser’s opportunity cost of consuming one unit of budget and allow the advertiser to handle each auction as an independent bidding problem.
This reframes the online allocation problem as a problem of pricing resources to enable optimal decision making. The key innovation of our algorithm is using machine learning to predict optimal prices in an online fashion: we choose prices dynamically using mirror descent, a popular optimization algorithm for training machine learning predictive models. Because prices for resources are referred to as “dual variables” in the field of optimization, we call the resulting algorithm dual mirror descent.
The algorithm works sequentially by assuming uniform resource consumption over time is optimal and updating the dual variables after each action. It starts at a moment in time (t) by taking an action (xt) that maximizes the reward minus the opportunity cost of consuming resources (shown in the top gray box below). The action (e.g., how much to bid or which ad to show) is implemented if there are enough resources available. Then, the algorithm computes the error in the resource consumption (gt), which is the difference between uniform consumption over time and the actual resource consumption (below in the third gray box). A new dual variable for the next time period is computed using mirror descent based on the error, which then informs the next action. Mirror descent seeks to make the error as close as possible to zero, improving the accuracy of its estimate of the dual variable, so that resources are consumed uniformly over time. While the assumption of uniform resource consumption may be surprising, it helps avoid missing good opportunities and often aligns with commercial goals so is effective. Mirror descent also allows a variety of update rules; more details are in the paper.
![]() |
An overview of the dual mirror descent algorithm. |
By design, dual mirror descent has a self-correcting feature that prevents depleting resources too early or waiting too long to consume resources and missing good opportunities. When a request consumes more or less resources than the target, the corresponding dual variable is increased or decreased. When resources are then priced higher or lower, future actions are chosen to consume resources more conservatively or aggressively.
This algorithm is easy to implement, fast, and enjoys remarkable performance under different environments. These are some salient features of our algorithm:
- Existing methods require periodically solving large auxiliary optimization problems using past data. In contrast, this algorithm does not need to solve any auxiliary optimization problem and has a very simple rule to update the dual variables, which, in many cases, can be run in linear time complexity. Thus, it is appealing for many real-time applications that require fast decisions.
- There are minimal requirements on the structure of the problem. Such flexibility allows dual mirror descent to handle many applications across different sectors with minimal modifications. Moreover, our algorithms are flexible since they accommodate different objectives, constraints, or regularizers. By incorporating regularizers, decision makers can include important objectives beyond economic efficiency, such as fairness.
- Existing algorithms for online allocation problems are tailored for either adversarial or stochastic input data. Algorithms for adversarial inputs are robust as they make almost no assumptions on the structure of the data but, in turn, obtain performance guarantees that are too pessimistic in practice. On the other hand, algorithms for stochastic inputs enjoy better performance guarantees by exploiting statistical patterns in the data but can perform poorly when the model is misspecified. Dual mirror descent, however, attains performance close to optimal in both stochastic and adversarial input models while being oblivious to the structure of the input model. Compared to existing work on simultaneous approximation algorithms, our method is more general, applies to a wide range of problems, and requires no forecasts. Below is a comparison of our algorithm to other state-of-the-art methods. Results are based on synthetic data for an ad allocation problem.
![]() |
Performance of dual mirror descent, a training based method, and an adversarial method relative to the optimal offline solution. Lower values indicate performance closer to the optimal offline allocation. Results are generated using synthetic experiments based on public data for an ad allocation problem. |
In this post we introduced dual mirror descent, an algorithm for online allocation problems that is simple, robust, and flexible. It is particularly notable that after a long line of work in online allocation algorithms, dual mirror descent provides a way to analyze a wider range of algorithms with superior robustness priorities compared to previous techniques. Dual mirror descent has a wide range of applications across several commercial sectors and has been used over time at Google to help advertisers capture more value through better algorithmic decision making. We are also exploring further work related to mirror descent and its connections to PI controllers.
We would like to thank our co-authors Haihao Lu and Balu Sivan, and Kshipra Bhawalkar for their exceptional support and contributions. We would also like to thank our collaborators in the ad quality team and market algorithm research.
1Formalized in the equation below: ↩
![]() |
Discover insights from Zendesk with Amazon Kendra intelligent search
Customer relationship management (CRM) is a critical tool that organizations maintain to manage customer interactions and build business relationships. Zendesk is a CRM tool that makes it easy for customers and businesses to keep in sync. Zendesk captures a wealth of customer data, such as support tickets created and updated by customers and service agents, community discussions, and helpful guides. With such a wealth of complex data, simple keyword searches don’t suffice when it comes to discovering meaningful, accurate customer information.
Now you can use the Amazon Kendra Zendesk connector to index your Zendesk service tickets, help guides, and community posts, and perform intelligent search powered by machine learning (ML). Amazon Kendra smartly and efficiently answers natural language-based queries using advanced natural language processing (NLP) techniques. It can learn effectively from your Zendesk data, extracting meaning and context.
This post shows how to configure the Amazon Kendra Zendesk connector to index your Zendesk domain and take advantage of Amazon Kendra intelligent search. We use an example of an illustrative Zendesk domain to discuss technical topics related to AWS services.
Overview of solution
Amazon Kendra was built for intelligent search using NLP. You can use Amazon Kendra to ask factoid questions, descriptive questions, and perform keyword searches. You can use the Amazon Kendra connector for Zendesk to crawl your Zendesk domain and index service tickets, guides, and community posts to discover answers for your questions faster.
In this post, we show how to use the Amazon Kendra connector for Zendesk to index data from your Zendesk domain for intelligent search.
For this walkthrough, you should have the following prerequisites:
- An AWS account
- Administrator level access to your Zendesk domain
- Privileges to create an Amazon Kendra index, AWS resources, and AWS Identity and Access Management (IAM) roles and policies
- Basic knowledge of AWS services and working knowledge of Zendesk
Configure your Zendesk domain
Your Zendesk domain has a domain owner or administrator, service group administrators, and a customer. Sample service tickets, community posts, and guides have been created for the purpose of this walkthrough. A Zendesk API client with the unique identifier amazon_kendra
is registered to create an OAuth token for accessing your Zendesk domain from Amazon Kendra for crawling and indexing. The following screenshot shows the details of the OAuth configuration for the Zendesk API client.
Configure the data source using the Amazon Kendra connector for Zendesk
You can add the Zendesk connector data source to an existing Amazon Kendra index or create a new index. Then complete the following steps to configure the Zendesk connector:
- On the Amazon Kendra console, open the index and choose Data sources in the navigation pane.
- Under Zendesk, choose Add connector.
- Choose Add connector.
- In the Specify data source details section, enter the name and description of your data source and choose Next.
- In the Define access and security section, for Zendesk URL, enter the URL to your Zendesk domain. Use the URL format
. - Under Authentication, you can either choose Create to add a new secret using the user OAuth token created for the
, or use an existing AWS Secrets Manager secret that has the user OAuth token for the Zendesk domain that you want the connector to access. - Optionally, configure a new AWS secret for Zendesk API access.
- For IAM role, you can choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
- Choose Next.
- In the Configure sync settings section, provide information regarding your sync scope and run schedule.
- Choose Next.
- In the Set field mappings section, you can optionally configure the field mappings, or how the Zendesk field names are mapped to Amazon Kendra attributes or facets.
- Choose Next.
- Review your settings and confirm to add the data source.
- After the data source is created, select the data source and choose Sync Now.
- Choose Facet definition in the navigation pane.
- Select the check box in the
column for the facet_category
Run queries with the Amazon Kendra search console
Now that the data is synced, we can run a few search queries on the Amazon Kendra search console by navigating to the Search indexed content page.
For the first query, we ask Amazon Kendra a general question related to AWS service durability. The following screenshot shows the response. The suggested answer provides the correct answer to the query by applying natural language comprehension.
For our second query, let’s query Amazon Kendra to search for product issues from Zendesk service tickets. The following screenshot shows the response for the search, along with facets showing various categories of documents included in the result.
Notice the search result includes the URL to the source document as well. Choosing the URL takes us directly to the Zendesk service ticket page, as shown in the following screenshot.
Clean up
To avoid incurring future charges, clean up any resources created as part of this solution. Delete the Zendesk connector data source so any data indexed from the source is removed from the index. If you created a new Amazon Kendra index, delete the index as well.
In this post, we discussed how to configure the Amazon Kendra connector for Zendesk to crawl and index service tickets, community posts, and help guides. We showed how Amazon Kendra ML-based search enables your business leaders and agents to discover insights from your Zendesk content quicker and respond to customer needs faster.
To learn more about the Amazon Kendra connector for Zendesk, refer to the Amazon Kendra Developer Guide.
About the author
Rajesh Kumar Ravi is a Senior AI Services Solution Architect at Amazon Web Services specializing in intelligent document search with Amazon Kendra. He is a builder and problem solver, and contributes to the development of new ideas. He enjoys walking and loves to go on short hiking trips outside of work.
Amazon SageMaker Automatic Model Tuning now provides up to three times faster hyperparameter tuning with Hyperband
Amazon SageMaker Automatic Model Tuning introduces Hyperband, a multi-fidelity technique to tune hyperparameters as a faster and more efficient way to find an optimal model. In this post, we show how automatic model tuning with Hyperband can provide faster hyperparameter tuning—up to three times as fast.
The benefits of Hyperband
Hyperband presents two advantages over existing black-box tuning strategies: efficient resource utilization and a better time-to-convergence.
Machine learning (ML) models are increasingly training-intensive, involve complex models and large datasets, and require a lot of effort and resources to find the optimal hyperparameters. Traditional black-box search strategies, such as Bayesian, random search, or grid search, tend to scale linearly with the complexity of the ML problem at hand, requiring longer training time.
To speed up hyperparameter tuning and optimize training cost, Hyperband uses Asynchronous Successive Halving Algorithm (ASHA), a strategy that massively parallelizes hyperparameter tuning and automatically stops training jobs early by using previously evaluated configurations to predict whether a specific candidate is promising or not.
As we demonstrate in this post, Hyperband converges to the optimal objective metric faster than most black-box strategies and therefore saves training time. This allows you to tune larger-scale models where evaluating each hyperparameter configuration requires running an expensive training loop, such as in computer vision and natural language processing (NLP) applications. If you’re interested in finding the most accurate models, Hyperband also allows you to run your tuning jobs with more resources and converge to a better solution.
Hyperband with SageMaker
The new Hyperband approach implemented for hyperparameter tuning has a few new data elements changed through AWS API calls. Implementation via the AWS Management Console is not available at this time. Let’s look at some data elements introduced for Hyperband:
- Strategy – This defines the hyperparameter approach you want to choose. A new value for Hyperband is introduced with this change. Valid values are Bayesian, random, and Hyperband.
- MinResource – Defines the minimum number of epochs or iterations to be used for a training job before a decision is made to stop the training.
- MaxResource – Specifies the maximum number of epochs or iterations to be used for a training job to achieve the objective. This parameter is not required if you have numbers of training epochs defined as a hyperparameter in the tuning job.
The following sample code shows the tuning job config and training job definition:
The preceding code defines strategy
as Hyperband and also defines the lower and upper bound resource limits inside the strategy configuration using HyperbandStrategyConfig
, which serves as a lever to control the training runtime. For more details on how to configure and run automatic model tuning, refer to Specify the Hyperparameter Tuning Job Settings.
Hyperband compared to black-box search
In this section, we perform two experiments to compare Hyperband to a black-box search.
First experiment
In the first experiment, given a binary classification task, we aim to optimize a three-layer fully-connected network on a synthetic data set, which contains 5000 data points. The hyperparameters for the network include the number of units per layer, the learning rate of the Adam optimizer, the L2 regularization parameter, and the batch size. The range of these parameters are:
- Number of units per layer – 10 to 1e3
- Learning rate – 1e-4 to 0.1
- L2 regularization parameter – 1e-6 to 2
- Batch size – 10 to 200
The following graph shows our findings.
Comparing Hyperband with other strategies, given a target accuracy of 96%, Hyperband can achieve this target in 357 seconds, while Bayesian needs 560 seconds and random search needs 614 seconds. Hyperband consistently finds a more optimal solution at any given wall-clock time budget. This shows a clear advantage of a multi-fidelity optimization algorithm.
Second experiment
In this second experiment, we consider the Cifar10 dataset and train ResNet-20, a popular network architecture for computer vision tasks. We run all experiments on ml.g4dn.xlarge instances and optimize the neural network through SGD. We also apply standard image augmentation including random flip and random crop. The search space contains the following parameters:
- Mini-batch size: an integer from 4 to 512
- Learning rate of SGD: a float from 1e-6 to 1e-1
- Momentum of SGD: a float from 0.01 to 0.99
- Weight decay: a float from 1e-5 to 1
The following graph illustrates our findings.
Given a target validation accuracy 0.87, Hyperband can reach it in fewer than 2000 seconds while Random and Bayesian require 10000 and almost 9000 seconds respectively. This amount to a speed-up factor of ~5x and ~4.5x for Hyperband on this task compared to Random and Bayesian strategies. This shows a clear advantage for the multi-fidelity optimization algorithm, which significantly reduces the wall-clock time it takes to tune your deep learning models.
SageMaker Automatic Model Tuning allows you to reduce the time to tune a model by automatically searching for the best hyperparameter configuration within the ranges that you specify. You can find the best version of your model by running training jobs on your dataset with several search strategies, such as black-box or multi-fidelity strategies.
In this post, we discussed how you can now use a multi-fidelity strategy called Hyperband in SageMaker to find the best model. The support for Hyperband makes it possible for SageMaker Automatic Model Tuning to tune larger-scale models where evaluating each hyperparameter configuration requires running an expensive training loop, such as in computer vision and NLP applications.
Finally, we saw how Hyperband further optimizes runtime compared to black-box strategies with early stopping by using previously evaluated configurations to predict whether a specific candidate is promising and, if not, stop the evaluation to reduce the overall time and compute cost. Using Hyperband in SageMaker also allows you to specify the minimum and maximum resource in the HyperbandStrategyConfig
parameter for further runtime controls.
To learn more, visit Perform Automatic Model Tuning with SageMaker.
About the authors
Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solutions in the cloud.
Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the Financial Services industry with their operations in AWS. As a machine learning specialist, Gopi works to help customers succeed in their ML journey.
Xingchen Ma is an Applied Scientist at AWS. He works in the team owning the service for SageMaker Automatic Model Tuning.
Read webpages and highlight content using Amazon Polly
In this post, we demonstrate how to use Amazon Polly—a leading cloud service that converts text into lifelike speech—to read the content of a webpage and highlight the content as it’s being read. Adding audio playback to a webpage improves the accessibility and visitor experience of the page. Audio-enhanced content is more impactful and memorable, draws more traffic to the page, and taps into the spending power of visitors. It also improves the brand of the company or organization that publishes the page. Text-to-speech technology makes these business benefits attainable. We accelerate that journey by demonstrating how to achieve this goal using Amazon Polly.
This capability improves accessibility for visitors with disabilities, and could be adopted as part of your organization’s accessibility strategy. Just as importantly, it enhances the page experience for visitors without disabilities. Both groups have significant spending power and spend more freely from pages that use audio enhancement to grab their attention.
Overview of solution
(PRTP)—as we refer to the solution—allows a webpage publisher to drop an audio control onto their webpage. When the visitor chooses Play on the control, the control reads the page and highlights the content. PRTP uses the general capability of Amazon Polly to synthesize speech from text. It invokes Amazon Polly to generate two artifacts for each page:
- The audio content in a format playable by the browser: MP3
- A speech marks file that indicates for each sentence of text:
- The time during playback that the sentence is read
- The location on the page the sentence appears
When the visitor chooses Play, the browser plays the MP3 file. As the audio is read, the browser checks the time, finds in the marks file which sentence to read at that time, locates it on the page, and highlights it.
PRTP allows the visitor to read in different voices and languages. Each voice requires its own pair of files. PRTP uses neural voices. For a list of supported neural voices and languages, see Neural Voices. For a full list of standard and neural voices in Amazon Polly, see Voices in Amazon Polly.
We consider two types of webpages: static and dynamic pages. In a static page, the content is contained within the page and changes only when a new version of the page is published. The company might update the page daily or weekly as part of its web build process. For this type of page, it’s possible to pre-generate the audio files at build time and place them on the web server for playback. As the following figure shows, the script PRTP Pre-Gen
invokes Amazon Polly to generate the audio. It takes as input the HTML page itself and, optionally, a configuration file that specifies which text from the page to extract (Text Extract Config
). If the extract config is omitted, the pre-gen script makes a sensible choice of text to extract from the body of the page. Amazon Polly outputs the files in an Amazon Simple Storage Service (Amazon S3) bucket; the script copies them to your web server. When the visitor plays the audio, the browser downloads the MP3 directly from the web server. For highlights, a drop-in library, PRTP.js
, uses the marks file to highlight the text being read.
The content of a dynamic page changes in response to the visitor interaction, so audio can’t be pre-generated but must be synthesized dynamically. As the following figure shows, when the visitor plays the audio, the page uses PRTP.js
to generate the audio in Amazon Polly, and it highlights the synthesized audio using the same approach as with static pages. To access AWS services from the browser, the visitor requires an AWS identity. We show how to use an Amazon Cognito identity pool to allow the visitor just enough access to Amazon Polly and the S3 bucket to render the audio.
Generating both Mp3 audio and speech marks requires the Polly service to synthesize the same input twice. Refer to the Amazon Polly Pricing Page to understand cost implications. Pre-generation saves costs because synthesis is performed at build time rather than on-demand for each visitor interaction.
The code accompanying this post is available as an open-source repository on GitHub.
To explore the solution, we follow these steps:
- Set up the resources, including the pre-gen build server, S3 bucket, web server, and Amazon Cognito identity.
- Run the static pre-gen build and test static pages.
- Test dynamic pages.
To run this example, you need an AWS account with permission to use Amazon Polly, Amazon S3, Amazon Cognito, and (for demo purposes) AWS Cloud9.
Provision resources
We share an AWS CloudFormation template to create in your account a self-contained demo environment to help you follow along with the post. If you prefer to set up PRTP in your own environment, refer to instructions in README.md.
To provision the demo environment using CloudFormation, first download a copy of the CloudFormation template. Then complete the following steps:
- On the AWS CloudFormation console, choose Create stack.
- Choose With new resources (standard).
- Select Upload a template file.
- Choose Choose file to upload the local copy of the template that you downloaded. The name of the file is
. - Choose Next.
- Enter a stack name of your choosing. Later you enter this again as a replacement for <StackName>.
- You may keep default values in the Parameters section.
- Choose Next.
- Continue through the remaining sections.
- Read and select the check boxes in the Capabilities section.
- Choose Create stack.
- When the stack is complete, find the value for
in the stack outputs.
We encourage you to review the stack with your security team prior to using it a production environment.
Set up the web server and pre-gen server in an AWS Cloud9 IDE
Next, on the AWS Cloud9 console, locate the environment PRTPDemoCloud9
created by the CloudFormation stack. Choose Open IDE to open the AWS Cloud9 environment. Open a terminal window and run the following commands, which clones the PRTP code, sets up pre-gen dependencies, and starts a web server to test with:
For <StackName>, use the name you gave the CloudFormation stack. For <IngressCIDR>, specify a range of IP addresses allowed to access the web server. To restrict access to the browser on your local machine, find your IP address using https://whatismyipaddress.com/ and append /32
to specify the range. For example, if your IP is, use
. The server listens on port 8080. The public IP address on which the server listens is given in the output. For example:
Test static pages
In your browser, navigate to PRTPStaticDefault.html
. (If you’re using the demo, the URL is http://<cloud9host>:8080/web/PRTPStaticDefault.html
, where <cloud9host> is the public IP address that you discovered in setting up the IDE.) Choose Play on the audio control at the top. Listen to the audio and watch the highlights. Explore the control by changing speeds, changing voices, pausing, fast-forwarding, and rewinding. The following screenshot shows the page; the text “Skips hidden paragraph” is highlighted because it is currently being read.
Try the same for PRTPStaticConfig.html
and PRTPStaticCustom.html
. The results are similar. For example, all three read the alt text for the photo of the cat (“Random picture of a cat”). All three read NE, NW, SE, and SW as full words (“northeast,” “northwest,” “southeast,” “southwest”), taking advantage of Amazon Polly lexicons.
Notice the main differences in audio:
reads all the text in the body of the page, including the wrapup portion at the bottom with “Your thoughts in one word,” “Submit Query,” “Last updated April 1, 2020,” and “Questions for the dev team.”PRTPStaticConfig.html
don’t read these because they explicitly exclude the wrapup from speech synthesis. -
reads the QB Best Sellers table differently from the others. It reads the first three rows only, and reads the row number for each row. It repeats the columns for each row.PRTPStaticCustom.html
uses a custom transformation to tailor the readout of the table. The other pages use default table rendering. -
reads “Tom Brady” at a louder volume than the rest of the text. It uses the speech synthesis markup language (SSML)prosody
tag to tailor the reading of Tom Brady. The other pages don’t tailor in this way. -
, thanks to a custom transformation, reads the main tiles in NW, SW, NE, SE order; that is, it reads “Today’s Articles,” “Quote of the Day,” “Photo of the Day,” “Jokes of the Day.” The other pages read in the order the tiles appear in the natural NW, NE, SW, SE order they appear in the HTML: “Today’s Articles,” “Photo of the Day,” “Quote of the Day,” “Jokes of the Day.”
Let’s dig deeper into how the audio is generated, and how the page highlights the text.
Static pre-generator
Our GitHub repo includes pre-generated audio files for the PRPTStatic
pages, but if you want to generate them yourself, from the bash shell in the AWS Cloud9 IDE, run the following commands:
Now let’s look at how those scripts work.
Default case
We begin with gen_default.sh
The script begins by running the Python program FixHTML.py
to make the source HTML file PRTPStaticDefault.html
well-formed. It writes the well-formed version of the file to example/tmp_wff.html
. This step is crucial for two reasons:
- Most source HTML is not well formed. This step repairs the source HTML to be well formed. For example, many HTML pages don’t close
elements. This step closes them. - We keep track of where in the HTML page we find text. We need to track locations using the same document object model (DOM) structure that the browser uses. For example, the browser automatically adds a
. The Python program follows the same well-formed repairs as the browser.
takes the well-formed HTML as input, applies an XML stylesheet transformation (XSLT) transformation to it, and outputs an SSML file. (SSML is the language in Amazon Polly to control how audio is rendered from text.) In the current example, the input is example/tmp_wff.html
. The output is example/tmp.ssml
. The transform’s job is to decide what text to extract from the HTML and feed to Amazon Polly. generic.xslt
is a sensible default XSLT transform for most webpages. In the following example code snippet, it excludes the audio control, the HTML header, as well as HTML elements like script
and form
. It also excludes elements with the hidden attribute. It includes elements that typically contain text, such as P
, H1
, and SPAN
. For these, it renders both a mark, including the full XPath expression of the element, and the value of the element.
The following is a snippet of the SSML that is rendered. This is fed as input to Amazon Polly. Notice, for example, that the text “Skips hidden paragraph” is to be read in the audio, and we associate it with a mark, which tells us that this text occurs in the location on the page given by the XPath expression /html/body[1]/div[2]/ul[1]/li[1]
To generate audio in Amazon Polly, we call the script run_polly.sh
. It runs the AWS Command Line Interface (AWS CLI) command aws polly start-speech-synthesis-task
twice: once to generate MP3 audio, and again to generate the marks file. Because the generation is asynchronous, the script polls until it finds the output in the specified S3 bucket. When it finds the output, it downloads to the build server and copies the files to the web/polly
folder. The following is a listing of the web folders:
- PRTPStaticDefault.html
- PRTPStaticConfig.html
- PRTPStaticCustom.html
- PRTP.js
- polly/PRTPStaticDefault/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
- polly/PRTPStaticConfig/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
- polly/PRTPStaticCustom/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
Each page has its own set of voice-specific MP3 and marks files. These files are the pre-generated files. The page doesn’t need to invoke Amazon Polly at runtime; the files are part of the web build.
Config-driven case
Next, consider gen_config.sh
The script is similar to the script in the default case, but the bolded lines indicate the main difference. Our approach is config-driven. We tailor the content to be extracted from the page by specifying what to extract through configuration, not code. In particular, we use the JSON file transform_config.json
, which specifies that the content to be included are the elements with IDs title
, main
, maintable
, and qbtable
. The element with ID wrapup
should be excluded. See the following code:
We run the Python program ModGenericXSLT.py
to modify generic.xslt
, used in the default case, to use the inclusions and exclusions that we specify in transform_config.json
. The program writes the results to a temp file (example/tmp.xslt
), which it passes to gen_ssml.sh
as its XSLT transform.
This is a low-code option. The web publisher doesn’t need to know how to write XSLT. But they do need to understand the structure of the HTML page and the IDs used in its main organizing elements.
Customization case
Finally, consider gen_custom.sh
This script is nearly identical to the default script, except it uses its own XSLT—example/custom.xslt
—rather than the generic XSLT. The following is a snippet of the XSLT:
If you want to study the code in detail, refer to the scripts and programs in the GitHub repo.
Browser setup and highlights
The static pages include an HTML5 audio control, which takes as its audio source the MP3 file generated by Amazon Polly and residing on the web server:
At load time, the page also loads the Amazon Polly-generated marks file. This occurs in the PRTP.js
file, which the HTML page includes. The following is a snippet of the marks file for PRTPStaticDefault
During audio playback, there is an audio timer event handler in PRTP.js
that checks the audio’s current time, finds the text to highlight, finds its location on the page, and highlights it. The text to be highlighted is an entry of type sentence
in the marks file. The location is the XPath expression in the name attribute of the entry of type SSML that precedes the sentence. For example, if the time is 18400, according to the marks file, the sentence to be highlighted is “Skips hidden paragraph,” which starts at 18334. The location is the SSML entry at time 17667: /html/body[1]/div[2]/ul[1]/li[1]
Test dynamic pages
The page PRTPDynamic.html
demonstrates dynamic audio readback using default, configuration-driven, and custom audio extraction approaches.
Default case
In your browser, navigate to PRTPDynamic.html
. The page has one query parameter, dynOption
, which accepts values default
, config
, and custom
. It defaults to default
, so you may omit it in this case. The page has two sections with dynamic content:
- Latest Articles – Changes frequently throughout the day
- Greek Philosophers Search By Date – Allows the visitor to search for Greek philosophers by date and shows the results in a table
Create some content in the Greek Philosopher section by entering a date range of -800 to 0, as shown in the example. Then choose Find.
Now play the audio by choosing Play in the audio control.
Behind the scenes, the page runs the following code to render and play the audio:
First it calls the function buildSSMLFromDefault
in PRTP.js
to extract most of the text from the HTML page body. That function walks the DOM tree, looking for text in common elements such as p
, h1
, pre
, span
, and td
. It ignores text in elements that usually don’t contain text to be read aloud, such as audio
, option
, and script
. It builds SSML markup to be input to Amazon Polly. The following is a snippet showing extraction of the first row from the philosopher
The chooseRenderAudio
function in PRTP.js
begins by initializing the AWS SDK for Amazon Cognito, Amazon S3, and Amazon Polly. This initialization occurs only once. If chooseRenderAudio
is invoked again because the content of the page has changed, the initialization is skipped. See the following code:
It generates MP3 audio from Amazon Polly. The generation is synchronous for small SSML inputs and asynchronous (with output sent to the S3 bucket) for large SSML inputs (greater than 6,000 characters). In the synchronous case, we ask Amazon Polly to provide the MP3 file using a presigned URL. When the synthesized output is ready, we set the src
attribute of the audio control to that URL and load the control. We then request the marks file and load it the same way as in the static case. See the following code:
Config-driven case
In your browser, navigate to PRTPDynamic.html?dynOption=config
. Play the audio. The audio playback is similar to the default case, but there are minor differences. In particular, some content is skipped.
Behind the scenes, when using the config
option, the page extracts content differently than in the default case. In the default case, the page uses buildSSMLFromDefault
. In the config-driven case, the page specifies the sections it wants to include and exclude:
The buildSSMLFromConfig
function, defined in PRTP.js
, walks the DOM tree in each of the sections whose ID is provided under inclusions
. It extracts content from each and combines them together, in the order specified, to form an SSML document. It excludes the sections specified under exclusions
. It extracts content from each section in the same way buildSSMLFromDefault
extracts content from the page body.
Customization case
In your browser, navigate to PRTPDynamic.html?dynOption=custom
. Play the audio. There are three noticeable differences. Let’s note these and consider the custom code that runs behind the scenes:
- It reads the main tiles in NW, SW, NE, SE order. The custom code gets each of these cell blocks from
and adds them to the SSML in NW, SW, NE, SE order:
- “Tom Brady” is spoken loudly. The custom code puts “Tom Brady” text inside an SSML
- It reads only the first three rows of the quarterback table. It reads the column headers for each row. Check the code in the GitHub repo to discover how this is implemented.
Clean up
To avoid incurring future charges, delete the CloudFormation stack.
In this post, we demonstrated a technical solution to a high-value business problem: how to use Amazon Polly to read the content of a webpage and highlight the content as it’s being read. We showed this using both static and dynamic pages. To extract content from the page, we used DOM traversal and XSLT. To facilitate highlighting, we used the speech marks capability in Amazon Polly.
Learn more about Amazon Polly by visiting its service page.
Feel free to ask questions in the comments.
About the authors
Mike Havey is a Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. Visit his Amazon author page to read more.
Vineet Kachhawaha is a Solutions Architect at AWS with expertise in machine learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.