Language-agnostic processing of microblog data with text embeddings

    Research output: Contribution to conferencePosterOther research output


    A raw stream of posts from a microblogging platform such as Twitter contains text written in a large variety of languages and writing systems, in registers ranging from formal to internet slang. A significant amount has been expended in recent years to adapt standard NLP processing pipelines to be able to deal with such content. In this paper we suggest a less labor intensive approach to processing multilingual user generated content.

    We induce low-dimensional distributed representations of text by training a recurrent neural network on the raw bytestream of a microblog feed. Such representations have been recently shown to be effective when used as learned features for sequence labeling tasks such as word, sentence and text segmentation (Chrupała 2013, Evang et al. 2013).

    In the current work we propose two new scenarios for using such representations. Firstly we employ them in a sequence transduction setting for tweet normalization. Secondly we propose a simple way to build a distributed bag-of-words analog using byte-level text embeddings, and apply it in a hashtag recommendation model.


    Kilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. 2013. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP.

    Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language Processing.
    Original languageEnglish
    Publication statusPublished - 2014
    Event24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014) - Leiden, Netherlands
    Duration: 17 Jan 2014 → …


    Conference24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014)
    Period17/01/14 → …


    Dive into the research topics of 'Language-agnostic processing of microblog data with text embeddings'. Together they form a unique fingerprint.

    Cite this