Normalizing tweets with edit scripts and recurrent neural embeddings

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    Abstract

    Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.
    Original languageEnglish
    Title of host publicationProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
    EditorsKristina Toutanova, Hua Wu
    Place of PublicationBaltimore, Maryland
    PublisherAssociation for Computational Linguistics (ACL)
    Pages680-686
    Volume2
    EditionShort Papers
    ISBN (Electronic)978-1-937284-73-2
    Publication statusPublished - 2014
    EventThe 52nd Annual Meeting of the Association for Computational Linguistics - Baltimore, United States
    Duration: 22 Jun 201427 Jun 2014

    Conference

    ConferenceThe 52nd Annual Meeting of the Association for Computational Linguistics
    CountryUnited States
    CityBaltimore
    Period22/06/1427/06/14

    Cite this

    Chrupala, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. In K. Toutanova, & H. Wu (Eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers ed., Vol. 2, pp. 680-686). Baltimore, Maryland: Association for Computational Linguistics (ACL).
    Chrupala, Grzegorz. / Normalizing tweets with edit scripts and recurrent neural embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. editor / Kristina Toutanova ; Hua Wu. Vol. 2 Short Papers. ed. Baltimore, Maryland : Association for Computational Linguistics (ACL), 2014. pp. 680-686
    @inproceedings{7b71d1303ab34a8581678740e13a5aec,
    title = "Normalizing tweets with edit scripts and recurrent neural embeddings",
    abstract = "Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.",
    author = "Grzegorz Chrupala",
    year = "2014",
    language = "English",
    volume = "2",
    pages = "680--686",
    editor = "Kristina Toutanova and Hua Wu",
    booktitle = "Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics",
    publisher = "Association for Computational Linguistics (ACL)",
    edition = "Short Papers",

    }

    Chrupala, G 2014, Normalizing tweets with edit scripts and recurrent neural embeddings. in K Toutanova & H Wu (eds), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Short Papers edn, vol. 2, Association for Computational Linguistics (ACL), Baltimore, Maryland, pp. 680-686, The 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, United States, 22/06/14.

    Normalizing tweets with edit scripts and recurrent neural embeddings. / Chrupala, Grzegorz.

    Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. ed. / Kristina Toutanova; Hua Wu. Vol. 2 Short Papers. ed. Baltimore, Maryland : Association for Computational Linguistics (ACL), 2014. p. 680-686.

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    TY - GEN

    T1 - Normalizing tweets with edit scripts and recurrent neural embeddings

    AU - Chrupala, Grzegorz

    PY - 2014

    Y1 - 2014

    N2 - Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.

    AB - Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.

    M3 - Conference contribution

    VL - 2

    SP - 680

    EP - 686

    BT - Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics

    A2 - Toutanova, Kristina

    A2 - Wu, Hua

    PB - Association for Computational Linguistics (ACL)

    CY - Baltimore, Maryland

    ER -

    Chrupala G. Normalizing tweets with edit scripts and recurrent neural embeddings. In Toutanova K, Wu H, editors, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Short Papers ed. Vol. 2. Baltimore, Maryland: Association for Computational Linguistics (ACL). 2014. p. 680-686