Abstract
Original language | English |
---|---|
Title of host publication | Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics |
Editors | Kristina Toutanova, Hua Wu |
Place of Publication | Baltimore, Maryland |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 680-686 |
Volume | 2 |
Edition | Short Papers |
ISBN (Electronic) | 978-1-937284-73-2 |
Publication status | Published - 2014 |
Event | The 52nd Annual Meeting of the Association for Computational Linguistics - Baltimore, United States Duration: 22 Jun 2014 → 27 Jun 2014 |
Conference
Conference | The 52nd Annual Meeting of the Association for Computational Linguistics |
---|---|
Country | United States |
City | Baltimore |
Period | 22/06/14 → 27/06/14 |
Cite this
}
Normalizing tweets with edit scripts and recurrent neural embeddings. / Chrupala, Grzegorz.
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. ed. / Kristina Toutanova; Hua Wu. Vol. 2 Short Papers. ed. Baltimore, Maryland : Association for Computational Linguistics (ACL), 2014. p. 680-686.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Scientific › peer-review
TY - GEN
T1 - Normalizing tweets with edit scripts and recurrent neural embeddings
AU - Chrupala, Grzegorz
PY - 2014
Y1 - 2014
N2 - Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.
AB - Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlabeled data via character-level neural text embeddings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word error rates on an English tweet normalization dataset. Our model improves on state-of-the-art with little training data and without any lexical resources.
M3 - Conference contribution
VL - 2
SP - 680
EP - 686
BT - Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
A2 - Toutanova, Kristina
A2 - Wu, Hua
PB - Association for Computational Linguistics (ACL)
CY - Baltimore, Maryland
ER -