We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and…Apple Machine Learning Research