Sentiment analysis of Large Language Models feedback: A multi-model comparative study in programming assessment
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Despite growing claims about the enhanced capabilities of successive generations of Large Language Models (LLMs), empirical evidence regarding differences in feedback quality and sentiment characteristics remains limited. This study systematically analyzes the sentiment and stylistic features of feedback generated by 18 contemporary LLMs across more than 6,000 student programming assignments. The analyzed models encompassed Anthropic’s claude-3-5-haiku, claude-opus-4-1, and claude-sonnet-4; Deepseek’s deepseek-chat and deepseek-reasoner; Google’s gemini-2.0-flash-lite, gemini-2.0-flash, gemini-2.5-flash-lite, gemini-2.5-flash, and gemini-2.5-pro; and OpenAI’s gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, and gpt-5. Using automated sentiment analysis with a RoBERTa-based classifier, the research quantified emotional tone distributions. It examined relationships between sentiment, feedback length, assigned grades, and task characteristics. Results revealed substantial heterogeneity in feedback properties across models: average comment length varied from 42 words (claude-haiku-3.5) to over 270 words (gemini-2.5-flash). At the same time, sentiment distributions differed markedly between providers. Hierarchical clustering identified two distinct groups based on sentiment patterns. However, these did not correspond simply to model architecture or vendor categories. A strong positive correlation (r = 0.707) emerged between feedback sentiment and numerical grades. In contrast, negative feedback demonstrated greater length than positive comments. Inter-model consistency in sentiment evaluation was notably low (ICC = 0.061), indicating substantial variation in how different LLMs express evaluative judgments for identical student responses. These findings demonstrate that automated feedback carries implicit emotional valence that varies systematically across models, highlighting the importance of careful model selection and calibration in educational applications. The study provides quantitative evidence for distinctive feedback characteristics among contemporary LLMs. It underscores the need to consider sentiment dimensions in AI-assisted assessment systems.