Standardized Assessment of LLM English Proficiency
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are increasingly used in language learning and assessment, yet their English proficiency is seldom reported against interpretable proficiency standards. We introduce the China’s Standards of English Language Ability (CSE) framework to assess the proficiency levels and subskills of LLMs. The test is referred to as the CSEBench and comprises 624 expert-annotated multiple-choice items across CSE Levels 2–7. Each item is accompanied by metadata, including difficulty level and subskill labels covering vocabulary, syntax, phonology, and cohesion/discourse. Critically, the dataset includes test responses from 2,050 middle school and and sophomore college students who are learning English as a second language. We evaluate closed-source models, open-source baselines, and enhanced open-source variants incorporating additional supervision and external knowledge. Results show a clear proficiency divide: after mapping model scores to CSE levels, closed‑source models consistently reach CSE Level 6, whereas most open‑source baselines cluster around CSE Levels 3–4. A follow‑up cognitive diagnostic analysis reveals that while closed‑source LLMs exhibit broad competence across subskills, open‑source models display persistent deficits—most pronounced in phonology. Crucially, these weaknesses are shown to be substantially reducible through targeted enhancements. CSEBench thus offers a proficiency-interpretable testbed for reporting LLM English ability and diagnosing subskill gaps.