BigNLI: Native Language Identification with Big Bird Embeddings

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    29 Downloads (Pure)

    Abstract

    Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and NLI transformer models have thus far failed to offer effective, practical alternatives. The current work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models (for which we reproduce previous work) by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample (Europe subreddit) and out-of-domain (TOEFL-11) performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.
    Original languageEnglish
    Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
    Subtitle of host publicationLREC-COLING 2024
    EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
    Pages2375–2382
    Number of pages8
    Publication statusPublished - 20 May 2024
    EventLREC-COLING 2024: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation - Torino, Italy
    Duration: 20 May 202425 May 2024
    https://lrec-coling-2024.org/

    Conference

    ConferenceLREC-COLING 2024
    Country/TerritoryItaly
    CityTorino
    Period20/05/2425/05/24
    Internet address

    Keywords

    • native language identification
    • transformer embeddings
    • stylometry
    • text classification
    • natural language processing
    • computational linguistics

    Fingerprint

    Dive into the research topics of 'BigNLI: Native Language Identification with Big Bird Embeddings'. Together they form a unique fingerprint.

    Cite this