Elephant

Sequence labeling for word and sentence segmentation

Kilian Evang, Valerio Basile, Grzegorz Chrupala, Johan Bos

    Research output: Non-textual formSoftwareOther research output

    Abstract

    Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific.

    Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively.

    We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task?
    Original languageEnglish
    Media of outputOnline
    Publication statusPublished - 2013

    Fingerprint

    Supervised learning
    Labeling
    Pipelines

    Cite this

    Evang, K. (Author), Basile, V. (Author), Chrupala, G. (Author), & Bos, J. (Author). (2013). Elephant: Sequence labeling for word and sentence segmentation. Software, Retrieved from http://gmb.let.rug.nl/elephant/about.php
    Evang, Kilian (Author) ; Basile, Valerio (Author) ; Chrupala, Grzegorz (Author) ; Bos, Johan (Author). / Elephant : Sequence labeling for word and sentence segmentation. [Software].
    @misc{2efe27dc1ed04380a428c154bb47caa2,
    title = "Elephant: Sequence labeling for word and sentence segmentation",
    abstract = "Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific.Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively.We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task?",
    author = "Kilian Evang and Valerio Basile and Grzegorz Chrupala and Johan Bos",
    year = "2013",
    language = "English",

    }

    Elephant : Sequence labeling for word and sentence segmentation. Evang, Kilian (Author); Basile, Valerio (Author); Chrupala, Grzegorz (Author); Bos, Johan (Author). 2013.

    Research output: Non-textual formSoftwareOther research output

    TY - ADVS

    T1 - Elephant

    T2 - Sequence labeling for word and sentence segmentation

    AU - Evang, Kilian

    AU - Basile, Valerio

    AU - Chrupala, Grzegorz

    AU - Bos, Johan

    PY - 2013

    Y1 - 2013

    N2 - Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific.Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively.We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task?

    AB - Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific.Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively.We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task?

    M3 - Software

    ER -

    Evang K (Author), Basile V (Author), Chrupala G (Author), Bos J (Author). Elephant: Sequence labeling for word and sentence segmentation 2013.