When a customer asks Alexa to play “Hey Jude”, and Alexa responds, “Playing ‘Hey Jude’ by the Beatles,” that response is generated by a text-to-speech (TTS) system, which converts textual inputs into synthetic-speech outputs…Read More
Using wake word acoustics to filter out background speech improves speech recognition by 15%
One of the ways that we’re always trying to improve Alexa’s performance is by teaching her to ignore speech that isn’t intended for her. At this year’s International Conference on Acoustics, Speech, and Signal Processing, my colleagues and I will present a new technique for doing this, which could complement the techniques that Alexa already uses.Read More
Two new papers discuss how Alexa recognizes sounds
Last year, Amazon announced the beta release of Alexa Guard, a new service that lets customers who are leaving the house instruct their Echo devices to listen for glass breaking or smoke and carbon dioxide alarms going off. At this year’s International Conference on Acoustics, Speech, and Signal Processing, our team is presenting several papers on sound detection. I wrote about one of them a few weeks ago, a new method for doing machine learning with unbalanced data sets.Read More
Signal processor improves Echo’s bass response, loudness, and speech recognition accuracy
Multiband dynamics processing, which separately modifies volume in different frequency bands of an audio signal, is known to improve listeners’ audio experiences. But in the context of voice-controlled systems like the Amazon Echo family of products, it can also improve automatic speech recognition by making echo cancellation easier.Read More
Cross-lingual transfer learning for bootstrapping AI systems reduces new-language data requirements
Transfer learning is the technique of adapting a machine learning model trained on abundant data to a new context in which training data is sparse. On the Alexa team, we’ve explored transfer learning as a way to bootstrap new functions and to add new classification categories to existing machine learning systems.Read More
New speech recognition experiments demonstrate how machine learning can scale
Customer interactions with Alexa are constantly growing more complex, and on the Alexa science team, we strive to stay ahead of the curve by continuously improving Alexa’s speech recognition system. Increasingly, keeping pace with Alexa’s expanding capabilities will require automating the learning process, through techniques such as semi-supervised learning, which leverages a small amount of annotated data to extract information from a much larger store of unannotated data.Read More
Joint training on speech signal isolation and speech recognition improves performance
The idea of using arrays of microphones to improve automatic speech recognition (ASR) is decades old. The acoustic signal generated by a sound source reaches multiple microphones with different time delays. This information can be used to create virtual directivity, emphasizing a sound arriving from a direction of interest and diminishing signals coming from other directions. In voice recognition, one of the more popular methods for doing this is known as “beamforming”.Read More
Audio watermarking algorithm is first to solve “second-screen problem” in real time
Audio watermarking is the process of adding a distinctive sound pattern — undetectable to the human ear — to an audio signal to make it identifiable to a computer. It’s one of the ways that video sites recognize copyrighted recordings that have been posted illegally. To identify a watermark, a computer usually converts a digital file into an audio signal, which it processes internally.Read More
Adversarial training produces synthetic data for machine learning
Sentiment analysis is the attempt, computationally, to determine from someone’s words how he or she feels about something. It has a host of applications, in market research, media analysis, customer service, and product recommendation, among other things. Sentiment classifiers are typically machine learning systems, and any given application of sentiment analysis may suffer from a lack of annotated data for training purposes.Read More
Machine-labeled data + artificial noise = better speech recognition
Although deep neural networks have enabled accurate large-vocabulary speech recognition, training them requires thousands of hours of transcribed data, which is time-consuming and expensive to collect. So Amazon scientists have been investigating techniques that will let Alexa learn with minimal human involvement, techniques that fall in the categories of unsupervised and semi-supervised learning.Read More