Emergence of language structures from exposure to visually grounded speech signal

Grzegorz Chrupala, Afra Alishahi, Lieke Gelderloos

    Research output: Contribution to conferenceAbstractOther research output


    A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.
    Original languageEnglish
    Publication statusPublished - 2017
    EventComputational Linguistics in the Netherlands 27 - KU Leuven, Leuven, Belgium
    Duration: 10 Feb 2017 → …


    ConferenceComputational Linguistics in the Netherlands 27
    Period10/02/17 → …
    Internet address


    • speech
    • language and vision
    • cross-situational learning
    • grounding
    • neural networks
    • representation learning


    Dive into the research topics of 'Emergence of language structures from exposure to visually grounded speech signal'. Together they form a unique fingerprint.

    Cite this