Whole is Better: Examination on the Gold Training Data for Automatic Post-Editing 


Vol. 2,  No. 2, pp. 0-0, Oct.  2025
10.23246/AAIRJ.2025.02.02.01


PDF
  Abstract

Automatic Post-Editing (APE) research field aims to automatically rectifying errors in machine translation outputs with minimizing human intervention. The implementation of this requires a triplet dataset composed of the source sentence (src), the translated sentence (mt), and the correct-ed version of the translated text (pe). Emphasizing the challenges associated with data generation, numerous studies aiming at data augmentation have arisen. However, these studies predomi-nantly utilize human-curated gold training data without sufficient investigation. In this study, we raise doubts about this trend and point out that even within gold data, there are unnecessary data for training. Our motivation stems from the nature of the APE task, which involves both the need to replace all tokens in the machine translation and cases where the machine translation is al-ready perfect and should not be revised. We define these cases as extreme cases and verify the ef-fects that can be obtained by filtering each of them. We demonstrate that even with gold data, fil-tering out these extreme cases leads to the considerable performance improvement, more than 5 BLEU score in some cases. We conducted experiments using officially released, human-curated training data from WMT20, WMT21, and WMT22 and observed a common phenomenon across all datasets.

  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Cite this article

[IEEE Style]

H. Moon, S. Eo, J. Seo, C. Park, "Whole is Better: Examination on the Gold Training Data for Automatic Post-Editing," AAIRJ, vol. 2, no. 2, pp. 0-0, 2025. DOI: 10.23246/AAIRJ.2025.02.02.01.

[ACM Style]

Hyeonseok Moon, Sugyeong Eo, Jaehyung Seo, and Chanjun Park. 2025. Whole is Better: Examination on the Gold Training Data for Automatic Post-Editing. AAIRJ, 2, 2, (2025), 0-0. DOI: 10.23246/AAIRJ.2025.02.02.01.

[KICS Style]

Hyeonseok Moon, Sugyeong Eo, Jaehyung Seo, Chanjun Park, "Whole is Better: Examination on the Gold Training Data for Automatic Post-Editing," AAIRJ, vol. 2, no. 2, pp. 0-0, 2. 2025. (https://doi.org/10.23246/AAIRJ.2025.02.02.01)