Sentiment analysis of Large Language Models feedback: A multi-model comparative study in programming assessment

Marcin Jukiewicz

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite growing claims about the enhanced capabilities of successive generations of Large Language Models (LLMs), empirical evidence regarding differences in feedback quality and sentiment characteristics remains limited. This study systematically analyzes the sentiment and stylistic features of feedback generated by 18 contemporary LLMs across more than 6,000 student programming assignments. The analyzed models encompassed Anthropic’s claude-3-5-haiku, claude-opus-4-1, and claude-sonnet-4; Deepseek’s deepseek-chat and deepseek-reasoner; Google’s gemini-2.0-flash-lite, gemini-2.0-flash, gemini-2.5-flash-lite, gemini-2.5-flash, and gemini-2.5-pro; and OpenAI’s gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, and gpt-5. Using automated sentiment analysis with a RoBERTa-based classifier, the research quantified emotional tone distributions. It examined relationships between sentiment, feedback length, assigned grades, and task characteristics. Results revealed substantial heterogeneity in feedback properties across models: average comment length varied from 42 words (claude-haiku-3.5) to over 270 words (gemini-2.5-flash). At the same time, sentiment distributions differed markedly between providers. Hierarchical clustering identified two distinct groups based on sentiment patterns. However, these did not correspond simply to model architecture or vendor categories. A strong positive correlation (r = 0.707) emerged between feedback sentiment and numerical grades. In contrast, negative feedback demonstrated greater length than positive comments. Inter-model consistency in sentiment evaluation was notably low (ICC = 0.061), indicating substantial variation in how different LLMs express evaluative judgments for identical student responses. These findings demonstrate that automated feedback carries implicit emotional valence that varies systematically across models, highlighting the importance of careful model selection and calibration in educational applications. The study provides quantitative evidence for distinctive feedback characteristics among contemporary LLMs. It underscores the need to consider sentiment dimensions in AI-assisted assessment systems.

Version published to 10.21203/rs.3.rs-7718099/v1 on Research Square
Sep 29, 2025

Multitask & Meta Learning for Language Models: Enhancing Aspect Based Sentiment Analysis

This article has 5 authors:
1. Swati Karni
2. Priyanka Avhad
3. Rahul Tyagi
4. Ayush Prasad
5. Sufiyan Ahmed Mujawar
This article has no evaluationsLatest version Oct 6, 2025
Sentiment Detection in Low-Resource Language Gujarati: Evaluating Machine Translation Pipelines for Cross-Lingual Preservation

This article has 4 authors:
1. Neha Shah¹
2. Preeti Baser²
3. Niraj Shah
4. Parag Sanghani
This article has no evaluationsLatest version Oct 29, 2025
Exploring, Investigating and Exploiting Sentiment Analysis Systems

This article has 2 authors:
1. Phuong Dao Quoc
2. Vuong Ngo Minh
This article has no evaluationsLatest version Nov 20, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multitask &amp; Meta Learning for Language Models: Enhancing Aspect Based Sentiment Analysis

Sentiment Detection in Low-Resource Language Gujarati: Evaluating Machine Translation Pipelines for Cross-Lingual Preservation

Exploring, Investigating and Exploiting Sentiment Analysis Systems

Multitask & Meta Learning for Language Models: Enhancing Aspect Based Sentiment Analysis