Towards learning domain-general representations for language from multi-modal data

    Research output: Contribution to conferenceAbstractOther research output


    Recurrent neural networks (RNN) have gained a reputation for producing state-of-the-art results on many NLP tasks and for producing representations of words, phrases and larger linguistic units that encode complex syntactic and semantic structures. Recently these types of models have also been used extensively to deal with multi-modal data e.g.: to solve such problems as caption generation or automatic video description. The contribution of our present work is two-fold: a) we propose a novel multi-modal recurrent neural network architecture for domain-general representation learning, b) and a number of methods that “open up the black box”” and shed light on what kind of linguistic knowledge the network learns. We propose IMAGINET, a RNN architecture that learns visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with a shared word embedding matrix. It uses a multi-task objective by receiving a textual description of a scene and concurrently predicting its visual representation (extracted from images using a CNN trained on ImageNet) and the next word in the sentence. Moreover, we perform two exploratory analyses: a) show that the model learns to effectively use sequential structure in semantic interpretation and b) propose two methods to explore the importance of grammatical categories with respect to the model and the task. We observe that the model pays most attention to head-words, noun subjects and adjectival modifiers and least to determiners and prepositions.
    Original languageEnglish
    Publication statusPublished - 2015
    EventThe 26th Meeting of Computational Linguistics in the Netherlands (CLIN26) - Amsterdam, Netherlands
    Duration: 18 Dec 2015 → …


    ConferenceThe 26th Meeting of Computational Linguistics in the Netherlands (CLIN26)
    Period18/12/15 → …


    Dive into the research topics of 'Towards learning domain-general representations for language from multi-modal data'. Together they form a unique fingerprint.

    Cite this