Human versus automatic quality evaluation of NMT and PBSMT

Dimitar Shterionov*, Riccardo Superbo, Pat Nagle, Laura Casanellas, Tony O’Dowd, Andy Way

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

43 Citations (Scopus)


Neural machine translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. For its acceptance in industry it is important to investigate how NMT performs in comparison to the phrase-based statistical MT (PBSMT) model, that until recently was the dominant MT paradigm. In the present work, we compare the quality of the PBSMT and NMT solutions of KantanMT—a commercial platform for custom MT—that are tailored to accommodate large-scale translation production, where there is a limited amount of time to train an end-to-end system (NMT or PBSMT). In order to satisfy the time requirements of our production line, we restrict the NMT training time to 4 days; to train a PBSMT system typically requires no longer than one day with the current training pipeline of KantanMT. To train both NMT and PBSMT engines for each language pair, we strictly use the same parallel corpora and the same pre- and post-processing steps (when applicable). Our results show that, even with time-restricted training of 4 days, NMT quality substantially surpasses that of PBSMT. Furthermore, we challenge the reliability of automatic quality evaluation metrics based on n-gram comparison (in particular F-measure, BLEU and TER) for NMT quality evaluation. We support our hypothesis with both analytical and empirical evidence. We investigate how suitable these metrics are when comparing the two different paradigms.

Original languageEnglish
Pages (from-to)217-235
Number of pages19
JournalMachine Translation
Issue number3
Publication statusPublished - 1 Sept 2018
Externally publishedYes


  • A/B testing
  • BLEU
  • Evaluation metrics
  • F-measure
  • F-score
  • Human evaluation
  • Neural machine translation
  • NMT
  • Phrase-based statistical machine translation
  • Productivity
  • Quality comparison
  • Quality evaluation
  • Ranking
  • SMT
  • TER


Dive into the research topics of 'Human versus automatic quality evaluation of NMT and PBSMT'. Together they form a unique fingerprint.

Cite this