Optimizing T5 for Lightweight Tibetan-English Translation

Jacob Moore
Paula Lauren

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present the first lightweight Tibetan-English machine translation models optimized for low-resource settings and edge deployment. Our approach combines (1) a custom tokenizer trained on Tibetan script, (2) continued pretraining on Tibetan-English corpora, and (3) supervised fine-tuning on domain-specific translation pairs. Through ablation studies, we quantify each component’s contribution to translation quality. Results show that both the tokenizer and pretraining significantly improve performance, especially at small data scales. This work establishes the first strong baseline results for Tibetan-English translation with compact models and offers a practical framework for other underrepresented, non-Latin-script languages.

Version published to 10.21203/rs.3.rs-7409829/v1 on Research Square
Nov 25, 2025

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Nov 18, 2025
A Hybrid Machine Translation Framework for Low-Resource Indian Languages Using Differential Programming Loss Optimization

This article has 4 authors:
1. Rituraj Dixit
2. Sarabjeet Singh Bedi
3. Ibrahim Aljubayri
4. Mohammad Zubair Khan
This article has no evaluationsLatest version Oct 1, 2025
Improving Large Language Models with Concept-Aware Fine-Tuning

This article has 5 authors:
1. Dacheng Tao
2. Michael Chen
3. Xikun ZHANG
4. Jiaxing Huang
5. Yingjie Wang
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AssameseRoBERTa: A Monolingual Language Model for Low-Resource Assamese NLP

A Hybrid Machine Translation Framework for Low-Resource Indian Languages Using Differential Programming Loss Optimization

Improving Large Language Models with Concept-Aware Fine-Tuning