Learning character-wise text representations with Elman nets

    Research output: Contribution to conferenceAbstractOther research output

    Abstract

    Simple recurrent networks (SRNs) were introduced
    by Elman (1990) in order to model temporal structures in general and sequential structure in language in
    particular. More recently, SRN-based language models have become practical to train on large datasets
    and shown to outperform n-gram language models for
    speech recognition (Mikolov et al., 2010). In a parallel development, word embeddings induced using feedforward neural networks have proved to provide expressive and informative features for many language
    processing tasks (Collobert et al., 2011; Socher et al.,
    2012).
    The majority of representations of text used in computational linguistics are based on words as the smallest units. Words are not always the most appropriate
    atomic unit: this is the case for languages where orthographic words correspond to whole English phrases
    or sentences. It is equally the case when the text analysis task needs to be performed at character level: for
    example when segmenting text into tokens or when
    normalizing corrupted text into its canonical form.
    In this work we propose a mechanism to learn
    character-level representations of text. Our representations are low-dimensional real-valued embeddings
    which form an abstraction over the character string
    prior to each position in a stream of characters. They
    correspond to the activation of the hidden layer in
    a simple recurrent neural network. The network is
    trained as a language model: it is sequentially presented with each character in a string (encoded using a one-hot vector) and learns to predict the next
    character in the sequence. The representation of history is stored in a limited number of hidden units (we
    use 400), which forces the network to create a compressed and abstract representation rather than memorize verbatim strings. After training the network on
    large amounts on unlabeled text, it can be run on unseen character sequences, and activations of its hidden
    layer units can be recorded at each position and used
    as features in a supervised learning model.

    We use these representation as input features (in addition to character n-grams) for text analysis tasks:
    learning to detect and label programming language
    code samples embedded in natural language text
    (Chrupala, 2013), learning to segment text into words
    and sentences (Evang et al., 2013) and learning to
    translate non-canonical user generated contents into
    a normalized form (Chrupala, 2014). For all tasks and
    languages we obtain consistent performance boosts in
    comparison with using only character n-gram features,
    with relative error reductions ranging from around
    12% for English tweet normalization to around 85%
    for Dutch word and sentence segmentation.

    References
    Chrupala, G. (2013). Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language
    Processing.
    Chrupala, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. ACL.
    Collobert, R., Weston, J., Bottou, L., Karlen, M.,
    Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.
    Elman, J. L. (1990). Finding structure in time. Cognitive science, 14, 179–211.
    Evang, K., Basile, V., Chrupala, G., & Bos, J. (2013).
    Elephant: Sequence labeling for word and sentence segmentation. EMNLP.
    Mikolov, T., Karafi´at, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. INTERSPEECH.
    Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. EMNLP-CoNLL.
    Original languageEnglish
    Publication statusPublished - 2014
    Event23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) - Brussels, Belgium
    Duration: 6 Jun 2014 → …

    Conference

    Conference23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014)
    CountryBelgium
    CityBrussels
    Period6/06/14 → …

    Fingerprint

    Recurrent neural networks
    Chemical activation
    Computational linguistics
    Feedforward neural networks
    Supervised learning
    Vector spaces
    Labeling
    Learning systems
    Labels
    Semantics
    Processing
    Deep learning

    Cite this

    Chrupala, G. (2014). Learning character-wise text representations with Elman nets. Abstract from 23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) , Brussels, Belgium.
    Chrupala, Grzegorz. / Learning character-wise text representations with Elman nets. Abstract from 23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) , Brussels, Belgium.
    @conference{e2cc93801e604ea594a8cdc62abb9e90,
    title = "Learning character-wise text representations with Elman nets",
    abstract = "Simple recurrent networks (SRNs) were introducedby Elman (1990) in order to model temporal structures in general and sequential structure in language inparticular. More recently, SRN-based language models have become practical to train on large datasetsand shown to outperform n-gram language models forspeech recognition (Mikolov et al., 2010). In a parallel development, word embeddings induced using feedforward neural networks have proved to provide expressive and informative features for many languageprocessing tasks (Collobert et al., 2011; Socher et al.,2012).The majority of representations of text used in computational linguistics are based on words as the smallest units. Words are not always the most appropriateatomic unit: this is the case for languages where orthographic words correspond to whole English phrasesor sentences. It is equally the case when the text analysis task needs to be performed at character level: forexample when segmenting text into tokens or whennormalizing corrupted text into its canonical form.In this work we propose a mechanism to learncharacter-level representations of text. Our representations are low-dimensional real-valued embeddingswhich form an abstraction over the character stringprior to each position in a stream of characters. Theycorrespond to the activation of the hidden layer ina simple recurrent neural network. The network istrained as a language model: it is sequentially presented with each character in a string (encoded using a one-hot vector) and learns to predict the nextcharacter in the sequence. The representation of history is stored in a limited number of hidden units (weuse 400), which forces the network to create a compressed and abstract representation rather than memorize verbatim strings. After training the network onlarge amounts on unlabeled text, it can be run on unseen character sequences, and activations of its hiddenlayer units can be recorded at each position and usedas features in a supervised learning model.We use these representation as input features (in addition to character n-grams) for text analysis tasks:learning to detect and label programming languagecode samples embedded in natural language text(Chrupala, 2013), learning to segment text into wordsand sentences (Evang et al., 2013) and learning totranslate non-canonical user generated contents intoa normalized form (Chrupala, 2014). For all tasks andlanguages we obtain consistent performance boosts incomparison with using only character n-gram features,with relative error reductions ranging from around12{\%} for English tweet normalization to around 85{\%}for Dutch word and sentence segmentation.ReferencesChrupala, G. (2013). Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and LanguageProcessing.Chrupala, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. ACL.Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.Elman, J. L. (1990). Finding structure in time. Cognitive science, 14, 179–211.Evang, K., Basile, V., Chrupala, G., & Bos, J. (2013).Elephant: Sequence labeling for word and sentence segmentation. EMNLP.Mikolov, T., Karafi´at, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. INTERSPEECH.Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. EMNLP-CoNLL.",
    author = "Grzegorz Chrupala",
    year = "2014",
    language = "English",
    note = "23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) ; Conference date: 06-06-2014",

    }

    Chrupala, G 2014, 'Learning character-wise text representations with Elman nets' 23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) , Brussels, Belgium, 6/06/14, .

    Learning character-wise text representations with Elman nets. / Chrupala, Grzegorz.

    2014. Abstract from 23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) , Brussels, Belgium.

    Research output: Contribution to conferenceAbstractOther research output

    TY - CONF

    T1 - Learning character-wise text representations with Elman nets

    AU - Chrupala, Grzegorz

    PY - 2014

    Y1 - 2014

    N2 - Simple recurrent networks (SRNs) were introducedby Elman (1990) in order to model temporal structures in general and sequential structure in language inparticular. More recently, SRN-based language models have become practical to train on large datasetsand shown to outperform n-gram language models forspeech recognition (Mikolov et al., 2010). In a parallel development, word embeddings induced using feedforward neural networks have proved to provide expressive and informative features for many languageprocessing tasks (Collobert et al., 2011; Socher et al.,2012).The majority of representations of text used in computational linguistics are based on words as the smallest units. Words are not always the most appropriateatomic unit: this is the case for languages where orthographic words correspond to whole English phrasesor sentences. It is equally the case when the text analysis task needs to be performed at character level: forexample when segmenting text into tokens or whennormalizing corrupted text into its canonical form.In this work we propose a mechanism to learncharacter-level representations of text. Our representations are low-dimensional real-valued embeddingswhich form an abstraction over the character stringprior to each position in a stream of characters. Theycorrespond to the activation of the hidden layer ina simple recurrent neural network. The network istrained as a language model: it is sequentially presented with each character in a string (encoded using a one-hot vector) and learns to predict the nextcharacter in the sequence. The representation of history is stored in a limited number of hidden units (weuse 400), which forces the network to create a compressed and abstract representation rather than memorize verbatim strings. After training the network onlarge amounts on unlabeled text, it can be run on unseen character sequences, and activations of its hiddenlayer units can be recorded at each position and usedas features in a supervised learning model.We use these representation as input features (in addition to character n-grams) for text analysis tasks:learning to detect and label programming languagecode samples embedded in natural language text(Chrupala, 2013), learning to segment text into wordsand sentences (Evang et al., 2013) and learning totranslate non-canonical user generated contents intoa normalized form (Chrupala, 2014). For all tasks andlanguages we obtain consistent performance boosts incomparison with using only character n-gram features,with relative error reductions ranging from around12% for English tweet normalization to around 85%for Dutch word and sentence segmentation.ReferencesChrupala, G. (2013). Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and LanguageProcessing.Chrupala, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. ACL.Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.Elman, J. L. (1990). Finding structure in time. Cognitive science, 14, 179–211.Evang, K., Basile, V., Chrupala, G., & Bos, J. (2013).Elephant: Sequence labeling for word and sentence segmentation. EMNLP.Mikolov, T., Karafi´at, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. INTERSPEECH.Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. EMNLP-CoNLL.

    AB - Simple recurrent networks (SRNs) were introducedby Elman (1990) in order to model temporal structures in general and sequential structure in language inparticular. More recently, SRN-based language models have become practical to train on large datasetsand shown to outperform n-gram language models forspeech recognition (Mikolov et al., 2010). In a parallel development, word embeddings induced using feedforward neural networks have proved to provide expressive and informative features for many languageprocessing tasks (Collobert et al., 2011; Socher et al.,2012).The majority of representations of text used in computational linguistics are based on words as the smallest units. Words are not always the most appropriateatomic unit: this is the case for languages where orthographic words correspond to whole English phrasesor sentences. It is equally the case when the text analysis task needs to be performed at character level: forexample when segmenting text into tokens or whennormalizing corrupted text into its canonical form.In this work we propose a mechanism to learncharacter-level representations of text. Our representations are low-dimensional real-valued embeddingswhich form an abstraction over the character stringprior to each position in a stream of characters. Theycorrespond to the activation of the hidden layer ina simple recurrent neural network. The network istrained as a language model: it is sequentially presented with each character in a string (encoded using a one-hot vector) and learns to predict the nextcharacter in the sequence. The representation of history is stored in a limited number of hidden units (weuse 400), which forces the network to create a compressed and abstract representation rather than memorize verbatim strings. After training the network onlarge amounts on unlabeled text, it can be run on unseen character sequences, and activations of its hiddenlayer units can be recorded at each position and usedas features in a supervised learning model.We use these representation as input features (in addition to character n-grams) for text analysis tasks:learning to detect and label programming languagecode samples embedded in natural language text(Chrupala, 2013), learning to segment text into wordsand sentences (Evang et al., 2013) and learning totranslate non-canonical user generated contents intoa normalized form (Chrupala, 2014). For all tasks andlanguages we obtain consistent performance boosts incomparison with using only character n-gram features,with relative error reductions ranging from around12% for English tweet normalization to around 85%for Dutch word and sentence segmentation.ReferencesChrupala, G. (2013). Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and LanguageProcessing.Chrupala, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. ACL.Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.Elman, J. L. (1990). Finding structure in time. Cognitive science, 14, 179–211.Evang, K., Basile, V., Chrupala, G., & Bos, J. (2013).Elephant: Sequence labeling for word and sentence segmentation. EMNLP.Mikolov, T., Karafi´at, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). Recurrent neural network based language model. INTERSPEECH.Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. EMNLP-CoNLL.

    M3 - Abstract

    ER -

    Chrupala G. Learning character-wise text representations with Elman nets. 2014. Abstract from 23rd Annual Belgian-Dutch Conference on Machine Learning (BENELEARN 2014) , Brussels, Belgium.