TY - GEN
T1 - Investigating backtranslation in neural machine translation
AU - Poncelas, Alberto
AU - Shterionov, Dimitar
AU - Way, Andy
AU - De Buy Wenniger, Gideon Maillette
AU - Passban, Peyman
PY - 2018
Y1 - 2018
N2 - A prerequisite for training corpus-based machine translation (MT) systems – either Statistical MT (SMT) or Neural MT (NMT) – is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus – both as a separate standalone dataset as well as combined with human-generated parallel data – affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.
AB - A prerequisite for training corpus-based machine translation (MT) systems – either Statistical MT (SMT) or Neural MT (NMT) – is the availability of high-quality parallel data. This is arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large parallel corpora are available; in cases where data is limited, SMT can still outperform NMT. Recently researchers have shown that back-translating monolingual data can be used to create synthetic parallel corpora, which in turn can be used in combination with authentic parallel data to train a high-quality NMT system. Given that large collections of new parallel text become available only quite rarely, backtranslation has become the norm when building state-of-the-art NMT systems, especially in resource-poor scenarios. However, we assert that there are many unknown factors regarding the actual effects of back-translated data on the translation capabilities of an NMT model. Accordingly, in this work we investigate how using back-translated data as a training corpus – both as a separate standalone dataset as well as combined with human-generated parallel data – affects the performance of an NMT model. We use incrementally larger amounts of back-translated data to train a range of NMT systems for German-to-English, and analyse the resulting translation performance.
UR - http://www.scopus.com/inward/record.url?scp=85072870719&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85072870719
T3 - EAMT 2018 - Proceedings of the 21st Annual Conference of the European Association for Machine Translation
SP - 249
EP - 258
BT - EAMT 2018 - Proceedings of the 21st Annual Conference of the European Association for Machine Translation
A2 - Perez-Ortiz, Juan Antonio
A2 - Sanchez-Martinez, Felipe
A2 - Espla-Gomis, Miquel
A2 - Popovic, Maja
A2 - Rico, Celia
A2 - Martins, Andre
A2 - Van den Bogaert, Joachim
A2 - Forcada, Mikel L.
PB - European Association for Machine Translation
T2 - 21st Annual Conference of the European Association for Machine Translation, EAMT 2018
Y2 - 28 May 2018 through 30 May 2018
ER -