A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.
|Publication status||Published - 2017|
|Event||Computational Linguistics in the Netherlands 27 - KU Leuven, Leuven, Belgium|
Duration: 10 Feb 2017 → …
|Conference||Computational Linguistics in the Netherlands 27|
|Period||10/02/17 → …|
- language and vision
- cross-situational learning
- neural networks
- representation learning
Chrupala, G., Alishahi, A., & Gelderloos, L. (2017). Emergence of language structures from exposure to visually grounded speech signal. Abstract from Computational Linguistics in the Netherlands 27, Leuven, Belgium.