Abstract
A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.
Original language | English |
---|---|
Publication status | Published - 2017 |
Event | Computational Linguistics in the Netherlands 27 - KU Leuven, Leuven, Belgium Duration: 10 Feb 2017 → … http://www.ccl.kuleuven.be/CLIN27/ |
Conference
Conference | Computational Linguistics in the Netherlands 27 |
---|---|
Country/Territory | Belgium |
City | Leuven |
Period | 10/02/17 → … |
Internet address |
Keywords
- speech
- language and vision
- cross-situational learning
- grounding
- neural networks
- representation learning