CrossLingBench: A Comprehensive Evaluation ofLarge Language Models on Multilingual NLPTasks Across Languages and Prompting Strategies

Ahmed Cherif

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multilingual natural language processing requires large language models (LLMs)to correctly understand, reason, and respond in languages beyond English, yetthe interaction between language choice, task type, and prompting strategyremains poorly characterised. We introduce CrossLingBench, a comprehensivebenchmark evaluating LLMs across four NLP task types (sentiment analysis,entity-type classification, factual QA, topic classification) in four non-Englishlanguages (French, Spanish, German, Chinese) at three difficulty levels, yielding 384 controlled evaluation instances per model. We define three automaticmetrics: Cross-lingual Accuracy Score (CAS), Language Self-Adherence (LSA),and Task Concept Alignment (TCA), combined into the CLB composite.Extensive experiments with three open LLMs under four prompting strategies(direct, CoT, translate-first, native-CoT) reveal that: (i) native-language chainof-thought achieves the best overall CLB, outperforming direct prompting by+0.11 on average; (ii) the translate-first strategy—commonly recommended bypractitioners—actually degrades performance for high-resource European languages; (iii) Chinese exhibits distinct challenges in language adherence withLSA dropping to 0.611 under direct prompting; and (iv) language-switchingbehaviour, where models revert to English mid-response, is a systematic failuremode requiring explicit mitigation. Our analysis provides actionable recommendations for multilingual LLM deployment and identifies critical gaps incross-lingual reasoning capabilities.

Version published to 10.21203/rs.3.rs-9181493/v1 on Research Square
Apr 17, 2026

Multilingual Evaluation of a Large Language Model-Based Primary Care Chatbot

This article has 4 authors:
1. Pei-Lun Chen
2. Amogh Ananda Rao
3. Sydney Pugh
4. Kevin B Johnson
This article has no evaluationsLatest version May 5, 2026
Beyond next-word prediction: hierarchical linguistic composition modulates LLM-brain alignment in time

This article has 2 authors:
1. Junyuan Zhao
2. Jonathan R. Brennan
This article has no evaluationsLatest version May 16, 2026
Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study

This article has 14 authors:
1. Khalid Nawab
2. Gretchen Ramsey
3. Samina Asfandiyar
4. Sayuj Atreya
5. Shadi Hijjawi
6. Sharatkumar Rokkam
7. Usman Ghayur
8. Akarshana Rajesh
9. Ihtesham Yousuf
10. Zefaf Ali Shah
11. Amit Kumar Misra
12. Madhushan Ponnala
13. Tauseef Hamid
14. Richard Schreiber
This article has no evaluationsLatest version Jun 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multilingual Evaluation of a Large Language Model-Based Primary Care Chatbot

Beyond next-word prediction: hierarchical linguistic composition modulates LLM-brain alignment in time

Cost-Performance Evaluation of Large Language Models for Aspect-Based Sentiment Analysis of HCAHPS Patient Comments: A Validation Study