Elephant: Sequence Labeling for Word and Sentence Segmentation

Kilian Evang, Valerio Basile, Grzegorz Chrupala, Johan Bos

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    Abstract

    Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.
    Original languageEnglish
    Title of host publicationProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
    Place of PublicationSeattle, Washington, USA
    PublisherAssociation for Computational Linguistics (ACL)
    Pages1422-1426
    ISBN (Electronic)978-1-937284-97-8
    Publication statusPublished - 2013
    EventEMNLP 2013: Conference on Empirical Methods in Natural Language Processing - Seattle, United States
    Duration: 18 Oct 201321 Oct 2013

    Conference

    ConferenceEMNLP 2013: Conference on Empirical Methods in Natural Language Processing
    CountryUnited States
    CitySeattle
    Period18/10/1321/10/13

    Fingerprint

    Labeling

    Cite this

    Evang, K., Basile, V., Chrupala, G., & Bos, J. (2013). Elephant: Sequence Labeling for Word and Sentence Segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1422-1426). Seattle, Washington, USA: Association for Computational Linguistics (ACL).
    Evang, Kilian ; Basile, Valerio ; Chrupala, Grzegorz ; Bos, Johan. / Elephant: Sequence Labeling for Word and Sentence Segmentation. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA : Association for Computational Linguistics (ACL), 2013. pp. 1422-1426
    @inproceedings{20178076b1fe4ed497f245a2ffcc97de,
    title = "Elephant: Sequence Labeling for Word and Sentence Segmentation",
    abstract = "Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.",
    author = "Kilian Evang and Valerio Basile and Grzegorz Chrupala and Johan Bos",
    year = "2013",
    language = "English",
    pages = "1422--1426",
    booktitle = "Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing",
    publisher = "Association for Computational Linguistics (ACL)",

    }

    Evang, K, Basile, V, Chrupala, G & Bos, J 2013, Elephant: Sequence Labeling for Word and Sentence Segmentation. in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), Seattle, Washington, USA, pp. 1422-1426, EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, Seattle, United States, 18/10/13.

    Elephant: Sequence Labeling for Word and Sentence Segmentation. / Evang, Kilian; Basile, Valerio; Chrupala, Grzegorz; Bos, Johan.

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA : Association for Computational Linguistics (ACL), 2013. p. 1422-1426.

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    TY - GEN

    T1 - Elephant: Sequence Labeling for Word and Sentence Segmentation

    AU - Evang, Kilian

    AU - Basile, Valerio

    AU - Chrupala, Grzegorz

    AU - Bos, Johan

    PY - 2013

    Y1 - 2013

    N2 - Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.

    AB - Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.

    M3 - Conference contribution

    SP - 1422

    EP - 1426

    BT - Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    PB - Association for Computational Linguistics (ACL)

    CY - Seattle, Washington, USA

    ER -

    Evang K, Basile V, Chrupala G, Bos J. Elephant: Sequence Labeling for Word and Sentence Segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics (ACL). 2013. p. 1422-1426