TY - GEN
T1 - Improving KantanMT training efficiency with fast align
AU - Shterionov, Dimitar
AU - Du, Jinhua
AU - Palminteri, Marc Anthony
AU - Casanellas, Laura
AU - O'Dowd, Tony
AU - Way, Andy
PY - 2016
Y1 - 2016
N2 - In recent years, statistical machine translation (SMT) has been widely deployed in translators' workflow with significant improvement of productivity. However, prior to invoking an SMT system to translate an unknown text, an SMT engine needs to be built. As such, building speed of the engine is essential for the translation workflow, i.e., the sooner an engine is built, the sooner it will be exploited. With the increase of the computational capabilities of recent technology the building time for an SMT engine has decreased substantially. For example, cloud-based SMT providers, such as KantanMT, can built high-quality, ready-to-use, custom SMT engines in less than a couple of days. To speed-up furthermore this process we look into optimizing the word alignment process that takes place during building the SMT engine. Namely, we substitute the word alignment tool used by KantanMT pipeline - Giza++ - with a more efficient one, i.e., fast_align. In this work we present the design and the implementation of the KantanMT pipeline that uses fast_align in place of Giza++. We also conduct a comparison between the two word alignment tools with industry data and report on our findings. Up to our knowledge, such extensive empirical evaluation of the two tools has not been done before.
AB - In recent years, statistical machine translation (SMT) has been widely deployed in translators' workflow with significant improvement of productivity. However, prior to invoking an SMT system to translate an unknown text, an SMT engine needs to be built. As such, building speed of the engine is essential for the translation workflow, i.e., the sooner an engine is built, the sooner it will be exploited. With the increase of the computational capabilities of recent technology the building time for an SMT engine has decreased substantially. For example, cloud-based SMT providers, such as KantanMT, can built high-quality, ready-to-use, custom SMT engines in less than a couple of days. To speed-up furthermore this process we look into optimizing the word alignment process that takes place during building the SMT engine. Namely, we substitute the word alignment tool used by KantanMT pipeline - Giza++ - with a more efficient one, i.e., fast_align. In this work we present the design and the implementation of the KantanMT pipeline that uses fast_align in place of Giza++. We also conduct a comparison between the two word alignment tools with industry data and report on our findings. Up to our knowledge, such extensive empirical evaluation of the two tools has not been done before.
UR - http://www.scopus.com/inward/record.url?scp=85033708858&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85033708858
T3 - Proceedings - AMTA 2016: 12th Conference of the Association for Machine Translation in the Americas
SP - 222
EP - 231
BT - MT Users' Track
A2 - Beregovaya, Olga
A2 - Doyon, Jennifer
A2 - Langlois, Lucie
A2 - Richardson, Steve
PB - Association for Machine Translation in the Americas
T2 - 12th Conference of the Association for Machine Translation in the Americas, AMTA 2016
Y2 - 28 October 2016 through 1 November 2016
ER -