Visual and audio events tend to occur together: a musician plucking guitar strings and the resulting melody; a wine glass shattering and the accompanying crash; the roar of a motorcycle as it accelerates. These visual and audio stimuli are concurrent because they share a common cause. Understanding the relationship between visual events and their associated sounds is a fundamental way that we make sense of the world around us.In Look, Listen, and Learn and Objects that Sound (to appear at ECCV 2018), we explore this observation by asking: what can be learnt by looking at and listening to a large number of unlabelled videos? By constructing an audio-visual correspondence learning task that enables visual and audio networks to be jointly trained from scratch, we demonstrate that:the networks are able to learn useful semantic concepts;the two modalities can be used to search one another (e.g. to answer the question, Which sound fits well with this image?); andthe object making the sound can be localised.Limitations of previous cross-modal learning approachesLearning from multiple modalities is not new; historically, researchers have largely focused on image-text or audio-vision pairings.Read More