Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.
Original languageEnglish
Title of host publicationProceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Place of PublicationFlorence, Italy
PublisherAssociation for Computational Linguistics
Pages6452-6462
Number of pages11
Publication statusPublished - 1 Jul 2019

Fingerprint

Image retrieval
Transcription
Semantics
Processing

Cite this

Chrupala, G. (2019). Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 6452-6462). Florence, Italy: Association for Computational Linguistics.
Chrupala, Grzegorz. / Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy : Association for Computational Linguistics, 2019. pp. 6452-6462
@inproceedings{d4ea6631efb14576baae6b70ec591e8e,
title = "Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language",
abstract = "A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.",
author = "Grzegorz Chrupala",
year = "2019",
month = "7",
day = "1",
language = "English",
pages = "6452--6462",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
publisher = "Association for Computational Linguistics",

}

Chrupala, G 2019, Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 6452-6462.

Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. / Chrupala, Grzegorz.

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy : Association for Computational Linguistics, 2019. p. 6452-6462.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

TY - GEN

T1 - Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language

AU - Chrupala, Grzegorz

PY - 2019/7/1

Y1 - 2019/7/1

N2 - A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.

AB - A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.

M3 - Conference contribution

SP - 6452

EP - 6462

BT - Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

PB - Association for Computational Linguistics

CY - Florence, Italy

ER -

Chrupala G. Symbolic Inductive Bias for Visually Grounded Learning of Spoken Language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics. 2019. p. 6452-6462