Bidirectional Transformer-Based Neural Machine Translation for Amharic and Tigrinya: Bridging Morphological Complexity and Data Scarcity

Matiyas Gutema Angecha
Martha Yifiru Tachbelie

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the growing reliance on artificial intelligence and digital platforms, which have made more multilingual communication possible. However, languages such as Amharic and Tigrinya, widely spoken in Ethiopia and Eritrea, remain underrepresented in machine translation research due to their complex linguistic structures and limited digital resources. Previous efforts have predominantly employed statistical, rule-based, or hybrid models, which often fail to capture the intricate morphological and syntactic patterns of these languages. In this study, we develop a bidirectional neural machine translation system using a transformer architecture, optimized to accommodate the unique linguistic characteristics of both languages. To address data scarcity, we augmented the training corpus through back translation and employed subword segmentation using SentencePiece to manage morphological richness. We conducted three core experiments: i) word-level tokens on a small original dataset, ii) subword-level modeling, and iii) with augmented data. The final setup yielded the highest performance, achieving BLEU scores of 44.32% (Amharic to Tigrinya) and 44.10% (Tigrinya to Amharic) using greedy decoding, while beam search decoding produced similarly comparable results. These findings highlight the effectiveness of subword tokenization and data augmentation for machine translation in low-resource, morphologically complex languages. Future research may explore morpheme-level representations and advanced decoding techniques to further enhance translation quality.

Version published to 10.21203/rs.3.rs-7040037/v1 on Research Square
Aug 1, 2025

FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding

This article has 1 author:
1. Ahsan Umar
This article has no evaluationsLatest version Sep 4, 2025
Somali Dialect Identification: A Low-Resource Benchmark for MAXAA TIRI and MAAY Using Machine and Deep Learning

This article has 5 authors:
1. Abdifatah Ahmed Gedi
2. Yusuf Mohamed Ahmed
3. Shafie Abdi Mohamed
4. Yusuf Ahmed Yusuf
5. Abdénuur Umur Ebdiyow
This article has no evaluationsLatest version Jul 22, 2025
Tokens with Meaning: A Hybrid Tokenization Approach for NLP

This article has 7 authors:
1. M. Ali Bayram
2. Ali Arda Fincan
3. Ahmet Semih Gümüş
4. Sercan Karakaş
5. Banu Diri
6. Savaş Yıldırım
7. Demircan Çelik
This article has no evaluationsLatest version Aug 6, 2025

Listed in

Abstract

Article activity feed

Related articles

FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding

Somali Dialect Identification: A Low-Resource Benchmark for MAXAA TIRI and MAAY Using Machine and Deep Learning

Tokens with Meaning: A Hybrid Tokenization Approach for NLP