Abstract
Tokenization is widely regarded as a solved problem due to the high accuracy that rule-based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high-accuracy word and sentence segmentation can be achieved by using supervised sequence labeling on the character level combined with unsupervised feature learning. We evaluated our method on three languages and obtained error rates of 0.27 ‰ (English), 0.35 ‰ (Dutch) and 0.76 ‰ (Italian) for our best models.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing |
| Place of Publication | Seattle, Washington, USA |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 1422-1426 |
| ISBN (Electronic) | 978-1-937284-97-8 |
| Publication status | Published - 2013 |
| Event | EMNLP 2013: Conference on Empirical Methods in Natural Language Processing - Seattle, United States Duration: 18 Oct 2013 → 21 Oct 2013 |
Conference
| Conference | EMNLP 2013: Conference on Empirical Methods in Natural Language Processing |
|---|---|
| Country/Territory | United States |
| City | Seattle |
| Period | 18/10/13 → 21/10/13 |
Fingerprint
Dive into the research topics of 'Elephant: Sequence Labeling for Word and Sentence Segmentation'. Together they form a unique fingerprint.Research output
- 32 Citations
- 1 Software
-
Elephant: Sequence labeling for word and sentence segmentation
Evang, K., Basile, V., Chrupala, G. & Bos, J., 2013Research output: Online publication or Non-textual form › Software
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver