Towards learning domain-general representations for language from multi-modal data

    Research output: Contribution to conferenceAbstractOther research output

    Abstract

    Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.
    Original languageEnglish
    Publication statusPublished - 2015
    EventThe 26th Meeting of Computational Linguistics in the Netherlands (CLIN26) - Amsterdam, Netherlands
    Duration: 18 Dec 2015 → …

    Conference

    ConferenceThe 26th Meeting of Computational Linguistics in the Netherlands (CLIN26)
    CountryNetherlands
    CityAmsterdam
    Period18/12/15 → …

    Fingerprint

    Recurrent neural networks
    Network architecture
    Linguistics
    Semantics
    Syntactics

    Cite this

    Kadar, A., Chrupala, G., & Alishahi, A. (2015). Towards learning domain-general representations for language from multi-modal data. Abstract from The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26), Amsterdam, Netherlands.
    Kadar, Akos ; Chrupala, Grzegorz ; Alishahi, Afra. / Towards learning domain-general representations for language from multi-modal data. Abstract from The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26), Amsterdam, Netherlands.
    @conference{b95fd8dccb734ebe8604ee7054718c3a,
    title = "Towards learning domain-general representations for language from multi-modal data",
    abstract = "Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.",
    author = "Akos Kadar and Grzegorz Chrupala and Afra Alishahi",
    year = "2015",
    language = "English",
    note = "The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26) ; Conference date: 18-12-2015",

    }

    Kadar, A, Chrupala, G & Alishahi, A 2015, 'Towards learning domain-general representations for language from multi-modal data' The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26), Amsterdam, Netherlands, 18/12/15, .

    Towards learning domain-general representations for language from multi-modal data. / Kadar, Akos; Chrupala, Grzegorz; Alishahi, Afra.

    2015. Abstract from The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26), Amsterdam, Netherlands.

    Research output: Contribution to conferenceAbstractOther research output

    TY - CONF

    T1 - Towards learning domain-general representations for language from multi-modal data

    AU - Kadar, Akos

    AU - Chrupala, Grzegorz

    AU - Alishahi, Afra

    PY - 2015

    Y1 - 2015

    N2 - Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.

    AB - Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.

    M3 - Abstract

    ER -

    Kadar A, Chrupala G, Alishahi A. Towards learning domain-general representations for language from multi-modal data. 2015. Abstract from The 26th Meeting of Computational Linguistics in the Netherlands (CLIN26), Amsterdam, Netherlands.