An Evaluation Framework for Dialectal Sentiment Classification and Linguistic Phenomena in Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Social media platforms provide individuals with a seamless way to share their opinions and interests using informal wording, creative spellings, local idioms, and frequent code-switching. This informal nature adds a layer of significant complexity to sentiment classification tasks. Recently, Large Language Models (LLMs) have shown promising capabilities in this area; however, previous research still lacks replicable and consistent evaluation protocols for assessing how these models reach their inferences, handle non-literal language, or explain the reasoning behind their decisions. To bridge this gap, this study introduces the Dialectal Sentiment Classification and Linguistic Phenomena (DSCLP) Framework. DSCLP is a four-phase protocol designed to analyze sentiment classification outcomes across one or more LLMs. The study applied DSCLP to a dataset of 1,469 Libyan-dialect social media posts, incorporating "Sarcastic" and "Ambiguous" as auxiliary labels alongside traditional sentiment categories. Two Generative Pre-trained Transformer (GPT) models, GPT-4o-mini and Gemini-1.5-flash, were examined under two prompting instructions: Model Default Inference (MDI) and Dialect-Aware Inference (DAI). Through the models’ APIs, each model produced a sentiment label and a rational explanation for every sentence in the dataset. The experiments showed that both models achieved moderate performance under the two defined prompting conditions. When the LLMs was instructed under MDI, the models showed biases toward Modern Standard Arabic (MSA) interpretations and still struggle with idioms, figurative language, and dual-sentiment expressions. Analysis of the generated rationale sentences revealed that LLMs frequently relied on literal understanding rather than cultural and contextual meanings in Libyan Arabic. By integrating performance metrics (Macro-F1, per-class F1, weighted-F1, and Cohen’s κ), rationale evaluation, and bias analysis, the DSCLP protocol demonstrated that it is a practical and reproducible method for studying LLM behavior in low-resource dialect settings. Future work may apply the protocol to additional dialects and explore model fine-tuning.