BigNLI: Native Language Identification with Big Bird Embeddings

Sergey Kramp, Giovanni Cassani, Chris Emmery*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

3 Downloads (Pure)

Abstract

Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and NLI transformer models have thus far failed to offer effective, practical alternatives. The current work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models (for which we reproduce previous work) by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample (Europe subreddit) and out-of-domain (TOEFL-11) performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.
Original languageEnglish
Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Subtitle of host publicationLREC-COLING 2024
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Pages2375–2382
Number of pages8
Publication statusPublished - 20 May 2024
EventLREC-COLING 2024: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
Duration: 20 May 202425 May 2024
https://lrec-coling-2024.org/

Conference

ConferenceLREC-COLING 2024
Country/TerritoryItaly
CityTorino
Period20/05/2425/05/24
Internet address

Keywords

  • native language identification
  • transformer embeddings
  • stylometry
  • text classification
  • natural language processing
  • computational linguistics

Fingerprint

Dive into the research topics of 'BigNLI: Native Language Identification with Big Bird Embeddings'. Together they form a unique fingerprint.

Cite this