How LLMs Assess Public Speaking? Methodology of Explaining LLM Judgments through Linguistic Patterns and Rhetorical Criteria
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study examines how Large Language Models, specifically GPT-4o-mini, evaluate public speaking performances based on textual transcripts. We collect a new annotation of speeches from the 3MT_French dataset and compare GPT-4o-mini annotations to those of an expert across both concrete rhetorical criteria and abstract subjective dimensions. In contrast to the expert, GPT-4o-mini exhibits limited cross-criterion integration and struggles with subjective judgments such as persuasiveness and creativity. Further, we utilise linguistic features in order to interpret provided annotations and demonstrated that GPT-4o-mini relies heavily on surface-level linguistic features and tends to prioritise structural and stylistic markers, while expert annotations reflect broader discourse-level understanding and persuasive intent. These findings highlight the limitations of using LLMs in high-level rhetorical evaluation and suggest the need for hybrid systems that combine model capabilities with theory-driven evaluation criteria. The annotated dataset and code are released to support future work in this direction: https://github.com/abarkar/ How-LLMs-Assess-Public-Speaking-/tree/main