Language-agnostic processing of microblog data with text embeddings

    Research output: Contribution to conferencePosterOther research output

    Abstract

    A raw stream of posts from a microblogging platform such as Twitter contains text written in a large variety of languages and writing systems, in registers ranging from formal to internet slang. A significant amount has been expended in recent years to adapt standard NLP processing pipelines to be able to deal with such content. In this paper we suggest a less labor intensive approach to processing multilingual user generated content.

    We induce low-dimensional distributed representations of text by training a recurrent neural network on the raw bytestream of a microblog feed. Such representations have been recently shown to be effective when used as learned features for sequence labeling tasks such as word, sentence and text segmentation (Chrupała 2013, Evang et al. 2013).

    In the current work we propose two new scenarios for using such representations. Firstly we employ them in a sequence transduction setting for tweet normalization. Secondly we propose a simple way to build a distributed bag-of-words analog using byte-level text embeddings, and apply it in a hashtag recommendation model.

    References

    Kilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. 2013. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP.

    Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language Processing.
    Original languageEnglish
    Publication statusPublished - 2014
    Event24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014) - Leiden, Netherlands
    Duration: 17 Jan 2014 → …

    Conference

    Conference24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014)
    CountryNetherlands
    CityLeiden
    Period17/01/14 → …

    Fingerprint

    Labeling
    Processing
    Recurrent neural networks
    Pipelines
    Internet
    Personnel
    Deep learning

    Cite this

    Chrupala, G. (2014). Language-agnostic processing of microblog data with text embeddings. Poster session presented at 24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014), Leiden, Netherlands.
    Chrupala, Grzegorz. / Language-agnostic processing of microblog data with text embeddings. Poster session presented at 24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014), Leiden, Netherlands.
    @conference{ca7e4bcac02a4ce5a73cae4c4777ef51,
    title = "Language-agnostic processing of microblog data with text embeddings",
    abstract = "A raw stream of posts from a microblogging platform such as Twitter contains text written in a large variety of languages and writing systems, in registers ranging from formal to internet slang. A significant amount has been expended in recent years to adapt standard NLP processing pipelines to be able to deal with such content. In this paper we suggest a less labor intensive approach to processing multilingual user generated content.We induce low-dimensional distributed representations of text by training a recurrent neural network on the raw bytestream of a microblog feed. Such representations have been recently shown to be effective when used as learned features for sequence labeling tasks such as word, sentence and text segmentation (Chrupała 2013, Evang et al. 2013).In the current work we propose two new scenarios for using such representations. Firstly we employ them in a sequence transduction setting for tweet normalization. Secondly we propose a simple way to build a distributed bag-of-words analog using byte-level text embeddings, and apply it in a hashtag recommendation model.ReferencesKilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. 2013. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP.Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language Processing.",
    author = "Grzegorz Chrupala",
    year = "2014",
    language = "English",
    note = "24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014) ; Conference date: 17-01-2014",

    }

    Chrupala, G 2014, 'Language-agnostic processing of microblog data with text embeddings' 24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014), Leiden, Netherlands, 17/01/14, .

    Language-agnostic processing of microblog data with text embeddings. / Chrupala, Grzegorz.

    2014. Poster session presented at 24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014), Leiden, Netherlands.

    Research output: Contribution to conferencePosterOther research output

    TY - CONF

    T1 - Language-agnostic processing of microblog data with text embeddings

    AU - Chrupala, Grzegorz

    PY - 2014

    Y1 - 2014

    N2 - A raw stream of posts from a microblogging platform such as Twitter contains text written in a large variety of languages and writing systems, in registers ranging from formal to internet slang. A significant amount has been expended in recent years to adapt standard NLP processing pipelines to be able to deal with such content. In this paper we suggest a less labor intensive approach to processing multilingual user generated content.We induce low-dimensional distributed representations of text by training a recurrent neural network on the raw bytestream of a microblog feed. Such representations have been recently shown to be effective when used as learned features for sequence labeling tasks such as word, sentence and text segmentation (Chrupała 2013, Evang et al. 2013).In the current work we propose two new scenarios for using such representations. Firstly we employ them in a sequence transduction setting for tweet normalization. Secondly we propose a simple way to build a distributed bag-of-words analog using byte-level text embeddings, and apply it in a hashtag recommendation model.ReferencesKilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. 2013. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP.Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language Processing.

    AB - A raw stream of posts from a microblogging platform such as Twitter contains text written in a large variety of languages and writing systems, in registers ranging from formal to internet slang. A significant amount has been expended in recent years to adapt standard NLP processing pipelines to be able to deal with such content. In this paper we suggest a less labor intensive approach to processing multilingual user generated content.We induce low-dimensional distributed representations of text by training a recurrent neural network on the raw bytestream of a microblog feed. Such representations have been recently shown to be effective when used as learned features for sequence labeling tasks such as word, sentence and text segmentation (Chrupała 2013, Evang et al. 2013).In the current work we propose two new scenarios for using such representations. Firstly we employ them in a sequence transduction setting for tweet normalization. Secondly we propose a simple way to build a distributed bag-of-words analog using byte-level text embeddings, and apply it in a hashtag recommendation model.ReferencesKilian Evang, Valerio Basile, Grzegorz Chrupała, Johan Bos. 2013. Elephant: Sequence Labeling for Word and Sentence Segmentation. EMNLP.Grzegorz Chrupała. 2013. Text segmentation with character-level text embeddings. ICML Workshop on Deep Learning for Audio, Speech and Language Processing.

    M3 - Poster

    ER -

    Chrupala G. Language-agnostic processing of microblog data with text embeddings. 2014. Poster session presented at 24th Meeting of Computational Linguistics in The Netherlands (CLIN 2014), Leiden, Netherlands.