Measuring the Information Density of Interlanguage: An Entropy Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Interlanguage development is often assessed through structural counts that only partially capture how learner language is organized probabilistically. This study proposes a multi-level framework for measuring interlanguage information density using entropy-based metrics. A corpus of 150 L2 English argumentative essays from B1, B2, and C1 learners was compared with a genre-matched native-speaker corpus of 50 essays. Four indicators were examined: lexical entropy (Hₗₑₓ), grammatical divergence from a native reference distribution via POS trigrams (KL₍gram₎), compression ratio (CR), and positional concentration index (PCI). To model native variability more defensibly, KL₍gram₎ for each L1 text was calculated against a leave-one-out L1 reference distribution. Results showed a clear developmental gradient: lexical entropy and positional concentration increased with proficiency, whereas grammatical divergence and compression ratio decreased. Mixed-effects models confirmed that these shifts were robust effects of proficiency. The findings support a probabilistic view of interlanguage development and offer a principled diagnostic framework for evaluating communicative efficiency in L2 writing.

Article activity feed