Optimizing T5 for Lightweight Tibetan-English Translation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present the first lightweight Tibetan-English machine translation models optimized for low-resource settings and edge deployment. Our approach combines (1) a custom tokenizer trained on Tibetan script, (2) continued pretraining on Tibetan-English corpora, and (3) supervised fine-tuning on domain-specific translation pairs. Through ablation studies, we quantify each component’s contribution to translation quality. Results show that both the tokenizer and pretraining significantly improve performance, especially at small data scales. This work establishes the first strong baseline results for Tibetan-English translation with compact models and offers a practical framework for other underrepresented, non-Latin-script languages.