How LLMs Assess Public Speaking? Methodology of Explaining LLM Judgments through Linguistic Patterns and Rhetorical Criteria

Alisa Barkar
Mathieu Chollet
Matthieu Labeau
Beatrice Biancardi
Chloé Clavel

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study examines how Large Language Models, specifically GPT-4o-mini, evaluate public speaking performances based on textual transcripts. We collect a new annotation of speeches from the 3MT_French dataset and compare GPT-4o-mini annotations to those of an expert across both concrete rhetorical criteria and abstract subjective dimensions. In contrast to the expert, GPT-4o-mini exhibits limited cross-criterion integration and struggles with subjective judgments such as persuasiveness and creativity. Further, we utilise linguistic features in order to interpret provided annotations and demonstrated that GPT-4o-mini relies heavily on surface-level linguistic features and tends to prioritise structural and stylistic markers, while expert annotations reflect broader discourse-level understanding and persuasive intent. These findings highlight the limitations of using LLMs in high-level rhetorical evaluation and suggest the need for hybrid systems that combine model capabilities with theory-driven evaluation criteria. The annotated dataset and code are released to support future work in this direction: https://github.com/abarkar/ How-LLMs-Assess-Public-Speaking-/tree/main

Version published to 10.21203/rs.3.rs-7139734/v1 on Research Square
Sep 8, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed