Attention Amplification in Multilingual LLMs: Why Script Representation Matters

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modern Large Language Models (LLMs) inherently exhibit a profound architectural bias toward English and other Latin-script languages, inadvertently erecting a severe “script barrier” for the vast majority of the world’s linguistic diversity. This barrier stems primarily from the inefficient subword tokenization of non-Roman scripts, such as Devanagari, where standard algorithms aggressively fragment text into high-fertility sequences. This fragmentation not only drastically shrinks the effective context window but also quadratically amplifies the computational cost of self-attention. To circumvent this tokenization bottleneck, this paper investigates romanization—the transliteration of native scripts into the Latin alphabet—as a highly efficient computational bridge. By aligning the input representation with the pre-existing orthographic strengths of Englishcentric models, romanization serves as a pragmatic interface layer rather than a linguistic replacement, fundamentally mitigating the computational penalties imposed by standard tokenizers. Our comprehensive empirical analysis, integrating a primary case study with findings from the ROMANSETU framework, demonstrates that romanizing Hindi text yields a consistent 2.5x to 4x reduction in token count. This efficiency directly translates to competitive or superior performance across a wide array of Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, particularly in generative and knowledge-retrieval domains. Furthermore, we formalize this computational overhead by deriving an attention amplification factor, revealing that native Devanagari processing requires over an order of magnitude more attention computation per unit of semantic content compared to its Romanized equivalent. We also systematically characterize the limitations of this pipeline, notably the risks of transliteration error propagation and the nuanced performance degradation on complex morpho-syntactic reasoning tasks. Ultimately, while romanization provides a powerful and immediately deployable strategy for enhancing multilingual AI efficiency, its necessity highlights the pressing, long-term requirement for fundamentally script-agnostic tokenization and multilingual model architectures.

Article activity feed