Elephant: Sequence labeling for word and sentence segmentation

Kilian Evang, Valerio Basile, Grzegorz Chrupala, Johan Bos

    Research output: Online publication or Non-textual formSoftwareOther research output


    Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific.

    Like an elephant in the living room, it is a problem that is impossible to overlook whenever new raw datasets need to be processed or when tokenization conventions are reconsidered. It is moreover an important problem, because any errors occurring early in the pipeline affect further analysis negatively.

    We believe that regarding tokenization, there is still room for improvement, in particular on the methodological side of the task. We are particularly interested in the following questions: Can we use supervised learning to avoid hand-crafting rules? Can we use unsupervised feature learning to reduce feature engineering effort and boost performance? Can we use the same method across languages? Can we combine word and sentence boundary detection into one task?
    Original languageEnglish
    Media of outputOnline
    Publication statusPublished - 2013


    Dive into the research topics of 'Elephant: Sequence labeling for word and sentence segmentation'. Together they form a unique fingerprint.

    Cite this