Attention Amplification in Multilingual LLMs: Why Script Representation Matters

Yash Mishra
Suyash Mishra
Kedarnath senapati

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Modern Large Language Models (LLMs) inherently exhibit a profound architectural bias toward English and other Latin-script languages, inadvertently erecting a severe “script barrier” for the vast majority of the world’s linguistic diversity. This barrier stems primarily from the inefficient subword tokenization of non-Roman scripts, such as Devanagari, where standard algorithms aggressively fragment text into high-fertility sequences. This fragmentation not only drastically shrinks the effective context window but also quadratically amplifies the computational cost of self-attention. To circumvent this tokenization bottleneck, this paper investigates romanization—the transliteration of native scripts into the Latin alphabet—as a highly efficient computational bridge. By aligning the input representation with the pre-existing orthographic strengths of Englishcentric models, romanization serves as a pragmatic interface layer rather than a linguistic replacement, fundamentally mitigating the computational penalties imposed by standard tokenizers. Our comprehensive empirical analysis, integrating a primary case study with findings from the ROMANSETU framework, demonstrates that romanizing Hindi text yields a consistent 2.5x to 4x reduction in token count. This efficiency directly translates to competitive or superior performance across a wide array of Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, particularly in generative and knowledge-retrieval domains. Furthermore, we formalize this computational overhead by deriving an attention amplification factor, revealing that native Devanagari processing requires over an order of magnitude more attention computation per unit of semantic content compared to its Romanized equivalent. We also systematically characterize the limitations of this pipeline, notably the risks of transliteration error propagation and the nuanced performance degradation on complex morpho-syntactic reasoning tasks. Ultimately, while romanization provides a powerful and immediately deployable strategy for enhancing multilingual AI efficiency, its necessity highlights the pressing, long-term requirement for fundamentally script-agnostic tokenization and multilingual model architectures.

Version published to 10.21203/rs.3.rs-8959575/v1 on Research Square
Feb 25, 2026

Natural Language Processing in the Era of Large Language Models: Foundations, Integration, and Low-Resource Frontiers

This article has 1 author:
1. Monisha Gottam
This article has no evaluationsLatest version Mar 6, 2026
A diagnostic and evaluative analysis of PARSEME corpora complexity

This article has 3 authors:
1. Santiago Fernández Lanza
2. Víctor Manuel Darriba Bilbao
3. Daniel Fernández-González
This article has no evaluationsLatest version Mar 30, 2026
Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese

This article has 4 authors:
1. Bolin Chang
2. Bin Li
3. Zhixing Xu
4. Shiyan Ou
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Natural Language Processing in the Era of Large Language Models: Foundations, Integration, and Low-Resource Frontiers

A diagnostic and evaluative analysis of PARSEME corpora complexity

Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese