Grounding language in the physical world enables humans to use words and sentences in context and to link them to actions. Several recent computer vision studies have worked on the task of expression grounding: learning to select that part of an image that depicts the referent of a multi-word expression. The task is approached by joint processing of the language expression, visual information of individual candidate referents, and in some cases the general visual context, using neural models that combine recurrent and convolutional components (Rohrbach et al., 2016; Hu et al., 2016b,a). However, there is more than just the intended referent by itself that determines how a referring expression is phrased. When referring to an element of a scene, its relations with and contrasts to other elements are taken into account in order to produce an expression that uniquely identifies the intended referent. Inspired by recent work on visual question answering using Relation Networks (Santoro et al., 2017) we build and evaluate models of expression grounding that take in account interactions between elements of the visual scene. We provide an analysis of the performance and the relational representations learned in this setting.
|Publication status||Published - 2017|
|Event||Computational Linguistics in the Netherlands - Nijmegen, Netherlands|
Duration: 26 Jan 2018 → …
|Conference||Computational Linguistics in the Netherlands|
|Period||26/01/18 → …|