A Python Tool for Selecting Domain-Specific Data in Machine Translation

    Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

    70 Downloads (Pure)

    Abstract

    As the volume of data for Machine Translation (MT) grows, the need for models that can perform well in specific use cases, like patent and medical translations, becomes increasingly important. Unfortunately, generic models do not work well in such cases, as they often fail to handle domain-specific style and terminology. Only using datasets that cover domains similar to the target domain to train MT systems can effectively lead to high translation quality (for a domain-specific use-case) (Wang et al., 2017; Pourmostafa Roshan Sharami et al., 2021; Pourmostafa Roshan Sharami et al., 2022). This highlights the limitation of data-driven MT when trained on general domain data, regardless of dataset size.

    To address this challenge, researchers have implemented various strategies to improve domain-specific translation using Domain Adaptation (DA) methods (Saunders, 2022; Sharami et al., 2023). The DA process involves initially training a generic model, which is then fine-tuned using a domain-specific dataset (Chu and Wang, 2018). One approach to generating a domain-specific dataset is to select similar data from generic corpora for a specific language pair and then utilize both general (to train) and domain-specific (to fine-tune) parallel corpora for MT. In line with this approach, we developed a language-agnostic Python tool implementing the methodology proposed by Sharami et al. (2022). This tool uses monolingual domain-specific corpora to generate a parallel in-domain corpus, facilitating data selection for DA.
    Original languageEnglish
    Title of host publicationProceedings of the 1st Workshop on Open Community-Driven Machine Translation
    Publisher Universitat d’Alacant
    Pages35-36
    Number of pages2
    ISBN (Print) 978-84-1302-228-4
    Publication statusPublished - 15 Jun 2023

    Keywords

    • Machine Translation
    • Domain-specific MT
    • Data Selection
    • Python
    • Machine Translation Tool

    Fingerprint

    Dive into the research topics of 'A Python Tool for Selecting Domain-Specific Data in Machine Translation'. Together they form a unique fingerprint.

    Cite this