CrossLingBench: A Comprehensive Evaluation ofLarge Language Models on Multilingual NLPTasks Across Languages and Prompting Strategies
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Multilingual natural language processing requires large language models (LLMs)to correctly understand, reason, and respond in languages beyond English, yetthe interaction between language choice, task type, and prompting strategyremains poorly characterised. We introduce CrossLingBench, a comprehensivebenchmark evaluating LLMs across four NLP task types (sentiment analysis,entity-type classification, factual QA, topic classification) in four non-Englishlanguages (French, Spanish, German, Chinese) at three difficulty levels, yielding 384 controlled evaluation instances per model. We define three automaticmetrics: Cross-lingual Accuracy Score (CAS), Language Self-Adherence (LSA),and Task Concept Alignment (TCA), combined into the CLB composite.Extensive experiments with three open LLMs under four prompting strategies(direct, CoT, translate-first, native-CoT) reveal that: (i) native-language chainof-thought achieves the best overall CLB, outperforming direct prompting by+0.11 on average; (ii) the translate-first strategy—commonly recommended bypractitioners—actually degrades performance for high-resource European languages; (iii) Chinese exhibits distinct challenges in language adherence withLSA dropping to 0.611 under direct prompting; and (iv) language-switchingbehaviour, where models revert to English mid-response, is a systematic failuremode requiring explicit mitigation. Our analysis provides actionable recommendations for multilingual LLM deployment and identifies critical gaps incross-lingual reasoning capabilities.