Measuring the Diversity of Automatic Image Descriptions

Emiel van Miltenburg, Desmond Elliott, Piek Vossen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.
Original languageEnglish
Title of host publicationProceedings of the 27th International Conference on Computational Linguistics
PublisherAssociation for Computational Linguistics
Pages1730-1741
Number of pages12
Publication statusPublished - Aug 2018
EventInternational Conference on Computational Linguistics 2018 - Santa Fe Community Convention Center, Santa Fe, United States
Duration: 20 Aug 201826 Aug 2018
Conference number: 27
http://coling2018.org/

Conference

ConferenceInternational Conference on Computational Linguistics 2018
Abbreviated titleCOLING 2018
CountryUnited States
CitySanta Fe
Period20/08/1826/08/18
Internet address

Fingerprint

Maximum likelihood

Cite this

van Miltenburg, E., Elliott, D., & Vossen, P. (2018). Measuring the Diversity of Automatic Image Descriptions. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1730-1741). Association for Computational Linguistics.
van Miltenburg, Emiel ; Elliott, Desmond ; Vossen, Piek. / Measuring the Diversity of Automatic Image Descriptions. Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 2018. pp. 1730-1741
@inproceedings{9266717f64c748ecb68e9beb1ac6b7e0,
title = "Measuring the Diversity of Automatic Image Descriptions",
abstract = "Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.",
author = "{van Miltenburg}, Emiel and Desmond Elliott and Piek Vossen",
year = "2018",
month = "8",
language = "English",
pages = "1730--1741",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
publisher = "Association for Computational Linguistics",

}

van Miltenburg, E, Elliott, D & Vossen, P 2018, Measuring the Diversity of Automatic Image Descriptions. in Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, pp. 1730-1741, International Conference on Computational Linguistics 2018, Santa Fe, United States, 20/08/18.

Measuring the Diversity of Automatic Image Descriptions. / van Miltenburg, Emiel; Elliott, Desmond; Vossen, Piek.

Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 2018. p. 1730-1741.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Measuring the Diversity of Automatic Image Descriptions

AU - van Miltenburg, Emiel

AU - Elliott, Desmond

AU - Vossen, Piek

PY - 2018/8

Y1 - 2018/8

N2 - Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.

AB - Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.

M3 - Conference contribution

SP - 1730

EP - 1741

BT - Proceedings of the 27th International Conference on Computational Linguistics

PB - Association for Computational Linguistics

ER -

van Miltenburg E, Elliott D, Vossen P. Measuring the Diversity of Automatic Image Descriptions. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics. 2018. p. 1730-1741