Abstract
We study the representation and encoding of phonemes in a recurrent
neural network model of grounded speech. We use a model which
processes images and their spoken descriptions, and projects the
visual and auditory representations into the same semantic space. We
perform a number of analyses on how information about individual
phonemes is encoded in the MFCC features extracted from the speech
signal, and the activations of the layers of the model. Via
experiments with phoneme decoding and phoneme discrimination we show
that phoneme representations are most salient in the lower layers of
the model, where low-level signals are processed at a fine-grained
level, although a large amount of phonological information is retain at
the top recurrent layer. We further find out that the
attention mechanism following the top recurrent layer significantly
attenuates encoding of phonology and makes the utterance embeddings
much more invariant to synonymy. Moreover, a hierarchical clustering
of phoneme representations learned by the network shows an
organizational structure of phonemes similar to those proposed in
linguistics.
neural network model of grounded speech. We use a model which
processes images and their spoken descriptions, and projects the
visual and auditory representations into the same semantic space. We
perform a number of analyses on how information about individual
phonemes is encoded in the MFCC features extracted from the speech
signal, and the activations of the layers of the model. Via
experiments with phoneme decoding and phoneme discrimination we show
that phoneme representations are most salient in the lower layers of
the model, where low-level signals are processed at a fine-grained
level, although a large amount of phonological information is retain at
the top recurrent layer. We further find out that the
attention mechanism following the top recurrent layer significantly
attenuates encoding of phonology and makes the utterance embeddings
much more invariant to synonymy. Moreover, a hierarchical clustering
of phoneme representations learned by the network shows an
organizational structure of phonemes similar to those proposed in
linguistics.
Original language | English |
---|---|
Title of host publication | Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) |
Editors | Roger Levy, Lucia Specia |
Place of Publication | Vancouver, Canada |
Publisher | Association for Computational Linguistics |
Pages | 368-378 |
Number of pages | 11 |
ISBN (Electronic) | 9781945626548 |
DOIs | |
Publication status | Published - 2017 |
Event | Conference on Computational Natural Language Learning: CoNLL 2017 - Vancouver, Canada Duration: 3 Aug 2017 → 4 Aug 2017 Conference number: 21 |
Conference
Conference | Conference on Computational Natural Language Learning |
---|---|
Country/Territory | Canada |
City | Vancouver |
Period | 3/08/17 → 4/08/17 |
Fingerprint
Dive into the research topics of 'Encoding of phonology in a recurrent neural model of grounded speech'. Together they form a unique fingerprint.Prizes
-
CoNLL 2017 Best Paper Award
Alishahi, A. (Recipient), Barking, M. (Recipient) & Chrupala, G. (Recipient), 2017
Prize