A Comparison of Different NMT Approaches to Low-Resource Dutch-Albanian Machine Translation

Arbnor Rama, Eva Vanmassenhove

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    Abstract

    Low-resource languages can be understood as languages that are more scarce, less studied, less privileged, less commonly taught and for which there are less resources available (Singh, 2008; Cieri et al., 2016; Magueresse et al., 2020). Natural Language Processing (NLP) research and technology mainly focuses on those languages for which there are large data sets available. To illustrate differences in data availability: there are 6 million Wikipedia articles available for English, 2 million for Dutch, and merely 82 thousand for Albanian. The scarce data issue becomes increasingly apparent when large parallel data sets are required for applications such as Neural Machine Translation (NMT). In this work, we investigate to what extent translation between Albanian (SQ) and Dutch (NL) is possible comparing a one-to-one (SQ↔AL) model, a low-resource pivot-based approach (English (EN) as pivot) and a zero-shot translation (ZST) (Johnson et al., 2016; Mattoni et al., 2017) system. From our experiments, it results that the EN-pivot-model outperforms both the direct one-to-one and the ZST model. Since often, small amounts of parallel data are available for low-resource languages or settings, experiments were conducted using small sets of parallel NL↔SQ data. The ZST appeared to be the worst performing models. Even when the available parallel data (NL↔SQ) was added, i.e. in a few-shot setting (FST), it remained the worst performing system according to the automatic (BLEU and TER) and human evaluation.
    Original languageEnglish
    Title of host publicationProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
    Subtitle of host publicationLoResMT2021
    Place of Publicationvirtual
    Pages68-77
    Number of pages10
    Edition4
    Publication statusPublished - Aug 2021
    EventWorkshop on Technologies for MT of Low Resource Languages - online
    Duration: 16 Aug 202116 Aug 2021
    Conference number: 4
    https://sites.google.com/view/loresmt/

    Workshop

    WorkshopWorkshop on Technologies for MT of Low Resource Languages
    Abbreviated titleLoResMT2021
    Period16/08/2116/08/21
    Internet address

    Fingerprint

    Dive into the research topics of 'A Comparison of Different NMT Approaches to Low-Resource Dutch-Albanian Machine Translation'. Together they form a unique fingerprint.

    Cite this