Abstract
We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.
Original language | English |
---|---|
Title of host publication | Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers |
Publisher | International Committee on Computational Linguistics |
Pages | 1309-1319 |
Number of pages | 10 |
ISBN (Electronic) | 978-4-87974-702-0 |
Publication status | Published - 2016 |
Fingerprint
Cite this
}
From phonemes to images : levels of representation in a recurrent neural model of visually-grounded language learning. / Gelderloos, L.J.; Chrupala, Grzegorz.
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. International Committee on Computational Linguistics, 2016. p. 1309-1319.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Scientific › peer-review
TY - GEN
T1 - From phonemes to images
T2 - levels of representation in a recurrent neural model of visually-grounded language learning
AU - Gelderloos, L.J.
AU - Chrupala, Grzegorz
PY - 2016
Y1 - 2016
N2 - We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities.We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.
AB - We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities.We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.
M3 - Conference contribution
SP - 1309
EP - 1319
BT - Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
PB - International Committee on Computational Linguistics
ER -