ChavacanoMT: A Corpus and Evaluation of Neural Machine Translation for Philippine Creole Spanish
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Chavacano, formally referred to as Philippine Creole Spanish, is the only Creole spoken in the Philippines. Like many languages, especially Creoles, computational studies on Chavacano are scarce because of the dearth of available corpora. This paper describes the creation of ChavacanoMT, a benchmark corpus for the machine translation study of Philippine Creole Spanish. ChavacanoMT consists of 767,053 parallel sentences between Chavacano and related languages, Spanish, Cebuano, Hiligaynon, Tagalog, and English. It is sourced from scraped bible translations and articles on the Jehovah’s Witness website. This paper also presents the performance of a multilingual neural machine translation model generated using ChavacanoMT. We report an overall 17 BLEU score on a fine-tuned mT5 model, outperforming an mT5-based model trained from scratch. Our experiments show that ChavacanoMT can generate models on par with a similar system that translates between English and some Philippines languages despite having fewer sentence samples used in training. We also report an improved Chavacano translation to and from its related languages that can be used as benchmark data. In particular, we highlight more than 20 BLEU points of improvement in the translation between Chavacano and English. The study opens avenues for exploring cross-linguistic interactions of Chavacano and its related languages in its translation that may benefit other low-resource languages.