Modeling relations in a referential game

    Research output: Contribution to conferenceAbstractOther research output


    Grounding language in the physical world enables humans to use words and sentences in context and to link them to actions. Several recent computer vision studies have worked on the task of expression grounding: learning to select that part of an image that depicts the referent of a multi-word expression. The task is approached by joint processing of the language expression, visual information of individual candidate referents, and in some cases the general visual context, using neural models that combine recurrent and convolutional components (Rohrbach et al., 2016; Hu et al., 2016b,a). However, there is more than just the intended referent by itself that determines how a referring expression is phrased. When referring to an element of a scene, its relations with and contrasts to other elements are taken into account in order to produce an expression that uniquely identifies the intended referent. Inspired by recent work on visual question answering using Relation Networks (Santoro et al., 2017) we build and evaluate models of expression grounding that take in account interactions between elements of the visual scene. We provide an analysis of the performance and the relational representations learned in this setting.
    Original languageEnglish
    Publication statusPublished - 2017
    EventComputational Linguistics in the Netherlands - Nijmegen, Netherlands
    Duration: 26 Jan 2018 → …


    ConferenceComputational Linguistics in the Netherlands
    Period26/01/18 → …


    Dive into the research topics of 'Modeling relations in a referential game'. Together they form a unique fingerprint.

    Cite this