Emergence of language structures from exposure to visually grounded speech signal

    Research output: Contribution to conferenceAbstractOther research output

    Abstract

    A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.
    Original languageEnglish
    Publication statusPublished - 2017
    EventComputational Linguistics in the Netherlands 27 - KU Leuven, Leuven, Belgium
    Duration: 10 Feb 2017 → …
    http://www.ccl.kuleuven.be/CLIN27/

    Conference

    ConferenceComputational Linguistics in the Netherlands 27
    CountryBelgium
    CityLeuven
    Period10/02/17 → …
    Internet address

    Fingerprint

    Recurrent neural networks
    Network architecture
    Linguistics
    Semantics
    Neural networks

    Keywords

    • speech
    • language and vision
    • cross-situational learning
    • grounding
    • neural networks
    • representation learning

    Cite this

    Chrupala, G., Alishahi, A., & Gelderloos, L. (2017). Emergence of language structures from exposure to visually grounded speech signal. Abstract from Computational Linguistics in the Netherlands 27, Leuven, Belgium.
    @conference{c633428f7e7148b5a2a417b54506860e,
    title = "Emergence of language structures from exposure to visually grounded speech signal",
    abstract = "A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.",
    keywords = "speech, language and vision, cross-situational learning, grounding, neural networks, representation learning",
    author = "Grzegorz Chrupala and Afra Alishahi and Lieke Gelderloos",
    year = "2017",
    language = "English",
    note = "Computational Linguistics in the Netherlands 27 ; Conference date: 10-02-2017",
    url = "http://www.ccl.kuleuven.be/CLIN27/",

    }

    Chrupala, G, Alishahi, A & Gelderloos, L 2017, 'Emergence of language structures from exposure to visually grounded speech signal' Computational Linguistics in the Netherlands 27, Leuven, Belgium, 10/02/17, .

    Emergence of language structures from exposure to visually grounded speech signal. / Chrupala, Grzegorz; Alishahi, Afra; Gelderloos, Lieke.

    2017. Abstract from Computational Linguistics in the Netherlands 27, Leuven, Belgium.

    Research output: Contribution to conferenceAbstractOther research output

    TY - CONF

    T1 - Emergence of language structures from exposure to visually grounded speech signal

    AU - Chrupala, Grzegorz

    AU - Alishahi, Afra

    AU - Gelderloos, Lieke

    PY - 2017

    Y1 - 2017

    N2 - A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.

    AB - A variety of computational models can learn meanings of words and sentences from exposure to word sequences coupled with the perceptual context in which they occur. More recently, neural network models have been applied to more naturalistic and more challenging versions of this problem: for example phoneme sequences, or raw speech audio signal accompanied by correlated visual features. In this work we introduce a multi-layer recurrent neural network model which is trained to project spoken sentences and their corresponding visual scene features into a shared semantic space. We then investigate to what extent representations of linguistic structures such as discrete words emerge in this model, and where within the network architecture they are localized. Our ultimate goal it to trace how auditory signals are progressively refined into meaning representations, and how this processes is learned from grounded speech data.

    KW - speech

    KW - language and vision

    KW - cross-situational learning

    KW - grounding

    KW - neural networks

    KW - representation learning

    M3 - Abstract

    ER -

    Chrupala G, Alishahi A, Gelderloos L. Emergence of language structures from exposure to visually grounded speech signal. 2017. Abstract from Computational Linguistics in the Netherlands 27, Leuven, Belgium.