Abstract
We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Like humans, it acquires meaning representations for individual words from descriptions of visual scenes. Moreover, it learns to effectively use sequential structure in semantic interpretation of multi-word phrases.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) |
| Editors | Chengqing Zong, Michael Strube |
| Place of Publication | Beijing, China |
| Publisher | Association for Computational Linguistics |
| Pages | 112-118 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781941643730 |
| Publication status | Published - 2015 |
| Event | 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing - Beijing, China Duration: 26 Jul 2015 → 31 Jul 2015 |
Conference
| Conference | 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing |
|---|---|
| Country/Territory | China |
| City | Beijing |
| Period | 26/07/15 → 31/07/15 |