A Novel Pipeline for Domain Detection and Selecting In-domain Sentences in Machine Translation Systems

Research output: Contribution to conferencePosterScientificpeer-review

Abstract

General-domain corpora are becoming increasingly available for Machine Translation (MT) systems. However, using those that cover the same or comparable domains allow achieving high translation quality of domain-specific MT. It is often the case that domain-specific corpora are scarce and cannot be used in isolation to effectively train (domain-specific) MT systems.

This work aims to improve in-domain MT by (i) a novel unsupervised pipeline for identifying distributions of different domains within a corpus and (ii) a data selection technique that leverages in-domain monolingual or parallel data to select domain-specific sentences from general corpora according to the distribution defined in (i). To do so, either a list with domain-specific keywords or an external lexical resource is fed into the pipeline to identify similar input data within the general domain. Furthermore, the suggested pipeline can determine the target domain of any corpus. That is, MT practitioners can prepare their training data, based on the target domain demanded by customers or industry. This idea is not only effective in terms of specifying frequent words in the corpus for the DA tasks but also in being able to inform the MT practitioners of insight into data (an informative feature).

The main idea of this work is related to Topic Modeling (TM) in the sense that a sentence is a distribution over hidden topics, and a topic is a distribution over words. Therefore, there is a high probability that similar sentences contain similar single words. In this way, we can select in-domain sentences if their top n-words match with general corpora’s top-words. Our pipeline encapsulates several modules such as TM, sentence embedding, dimensionality reduction, clustering, domain detection, post-processing, and a matching function. To test the effectiveness of our approach, the proposed method is applied on an English/French corpus, fitted and evaluated in the context of DA aiming to address the lack of in-domain data. Our empirical evaluation shows that more training data is not always better, and the best results are attainable via a proper domain-relevant data selection.
Original languageEnglish
DOIs
Publication statusPublished - 9 Jul 2021
EventThe 31st Meeting of Computational Linguistics in The Netherlands (CLIN 31) - Ghent University, Ghent, Belgium
Duration: 9 Jul 20219 Jul 2021
https://www.clin31.ugent.be/

Conference

ConferenceThe 31st Meeting of Computational Linguistics in The Netherlands (CLIN 31)
CountryBelgium
CityGhent
Period9/07/219/07/21
Internet address

Keywords

  • Machine Translation
  • Domain Adaptation
  • Domain Detection Pipeline
  • In-domain Generation

Cite this