Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

Spotting user-defined flexible keyword in real-time is challenging because
the keyword is represented in text. In this work, we propose a novel architecture
to efficiently detect the flexible keywords based on the following ideas. We contsruct the representative acousting embeding of a keyword using graphene-to-phone conversion. The phone-to-embedding conversion is done by looking up the embedding dictionary which is built by averaging the corresponding embeddings (from audio encoder) of each phone during the training. The key benefit of our approach is that both text embedding and audio…Apple Machine Learning Research