Abstract
As the volume of data for Machine Translation (MT) grows, the need for models that can perform well in specific use cases, like patent and medical translations, becomes increasingly important. Unfortunately, generic models do not work well in such cases, as they often fail to handle domain-specific style and terminology. Only using datasets that cover domains similar to the target domain to train MT systems can effectively lead to high translation quality (for a domain-specific use-case) (Wang et al., 2017; Pourmostafa Roshan Sharami et al., 2021; Pourmostafa Roshan Sharami et al., 2022). This highlights the limitation of data-driven MT when trained on general domain data, regardless of dataset size.
To address this challenge, researchers have implemented various strategies to improve domain-specific translation using Domain Adaptation (DA) methods (Saunders, 2022; Sharami et al., 2023). The DA process involves initially training a generic model, which is then fine-tuned using a domain-specific dataset (Chu and Wang, 2018). One approach to generating a domain-specific dataset is to select similar data from generic corpora for a specific language pair and then utilize both general (to train) and domain-specific (to fine-tune) parallel corpora for MT. In line with this approach, we developed a language-agnostic Python tool implementing the methodology proposed by Sharami et al. (2022). This tool uses monolingual domain-specific corpora to generate a parallel in-domain corpus, facilitating data selection for DA.
To address this challenge, researchers have implemented various strategies to improve domain-specific translation using Domain Adaptation (DA) methods (Saunders, 2022; Sharami et al., 2023). The DA process involves initially training a generic model, which is then fine-tuned using a domain-specific dataset (Chu and Wang, 2018). One approach to generating a domain-specific dataset is to select similar data from generic corpora for a specific language pair and then utilize both general (to train) and domain-specific (to fine-tune) parallel corpora for MT. In line with this approach, we developed a language-agnostic Python tool implementing the methodology proposed by Sharami et al. (2022). This tool uses monolingual domain-specific corpora to generate a parallel in-domain corpus, facilitating data selection for DA.
Original language | English |
---|---|
Title of host publication | Proceedings of the 1st Workshop on Open Community-Driven Machine Translation |
Publisher | Universitat d’Alacant |
Pages | 35-36 |
Number of pages | 2 |
ISBN (Print) | 978-84-1302-228-4 |
Publication status | Published - 15 Jun 2023 |
Keywords
- Machine Translation
- Domain-specific MT
- Data Selection
- Python
- Machine Translation Tool