A Comparison of Different NMT Approaches to Low-Resource Dutch-Albanian Machine Translation

Arbnor Rama, Eva Vanmassenhove

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Low-resource languages can be understood as languages that are more scarce, less studied, less privileged, less commonly taught and for which there are less resources available (Singh, 2008; Cieri et al., 2016; Magueresse et al., 2020). Natural Language Processing (NLP) research and technology mainly focuses on those languages for which there are large data sets available. To illustrate differences in data availability: there are 6 million Wikipedia articles available for English, 2 million for Dutch, and merely 82 thousand for Albanian. The scarce data issue becomes increasingly apparent when large parallel data sets are required for applications such as Neural Machine Translation (NMT). In this work, we investigate to what extent translation between Albanian (SQ) and Dutch (NL) is possible comparing a one-to-one (SQ↔AL) model, a low-resource pivot-based approach (English (EN) as pivot) and a zero-shot translation (ZST) (Johnson et al., 2016; Mattoni et al., 2017) system. From our experiments, it results that the EN-pivot-model outperforms both the direct one-to-one and the ZST model. Since often, small amounts of parallel data are available for low-resource languages or settings, experiments were conducted using small sets of parallel NL↔SQ data. The ZST appeared to be the worst performing models. Even when the available parallel data (NL↔SQ) was added, i.e. in a few-shot setting (FST), it remained the worst performing system according to the automatic (BLEU and TER) and human evaluation.
Original languageEnglish
Title of host publicationProceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
Subtitle of host publicationLoResMT2021
Place of Publicationvirtual
Pages68-77
Number of pages10
Edition4
Publication statusPublished - Aug 2021
EventWorkshop on Technologies for MT of Low Resource Languages - online
Duration: 16 Aug 202116 Aug 2021
Conference number: 4
https://sites.google.com/view/loresmt/

Workshop

WorkshopWorkshop on Technologies for MT of Low Resource Languages
Abbreviated titleLoResMT2021
Period16/08/2116/08/21
Internet address

Fingerprint

Dive into the research topics of 'A Comparison of Different NMT Approaches to Low-Resource Dutch-Albanian Machine Translation'. Together they form a unique fingerprint.

Cite this