An Evaluation Framework for Dialectal Sentiment Classification and Linguistic Phenomena in Large Language Models

Tarek Rashed
Ramadan Alfared
Abduelbaset Goweder
Husien Alhammi
Abubaker Kashada

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Social media platforms provide individuals with a seamless way to share their opinions and interests using informal wording, creative spellings, local idioms, and frequent code-switching. This informal nature adds a layer of significant complexity to sentiment classification tasks. Recently, Large Language Models (LLMs) have shown promising capabilities in this area; however, previous research still lacks replicable and consistent evaluation protocols for assessing how these models reach their inferences, handle non-literal language, or explain the reasoning behind their decisions. To bridge this gap, this study introduces the Dialectal Sentiment Classification and Linguistic Phenomena (DSCLP) Framework. DSCLP is a four-phase protocol designed to analyze sentiment classification outcomes across one or more LLMs. The study applied DSCLP to a dataset of 1,469 Libyan-dialect social media posts, incorporating "Sarcastic" and "Ambiguous" as auxiliary labels alongside traditional sentiment categories. Two Generative Pre-trained Transformer (GPT) models, GPT-4o-mini and Gemini-1.5-flash, were examined under two prompting instructions: Model Default Inference (MDI) and Dialect-Aware Inference (DAI). Through the models’ APIs, each model produced a sentiment label and a rational explanation for every sentence in the dataset. The experiments showed that both models achieved moderate performance under the two defined prompting conditions. When the LLMs was instructed under MDI, the models showed biases toward Modern Standard Arabic (MSA) interpretations and still struggle with idioms, figurative language, and dual-sentiment expressions. Analysis of the generated rationale sentences revealed that LLMs frequently relied on literal understanding rather than cultural and contextual meanings in Libyan Arabic. By integrating performance metrics (Macro-F1, per-class F1, weighted-F1, and Cohen’s κ), rationale evaluation, and bias analysis, the DSCLP protocol demonstrated that it is a practical and reproducible method for studying LLM behavior in low-resource dialect settings. Future work may apply the protocol to additional dialects and explore model fine-tuning.

Version published to 10.21203/rs.3.rs-8419385/v1 on Research Square
Dec 24, 2025

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

This article has 2 authors:
1. Shereen Fouad
2. Ezzaldin Alkooheji
This article has no evaluationsLatest version Jan 12, 2026
Sentiment Analysis of Naturalistic Speech Using Open-Weight Large Language Models

This article has 5 authors:
1. Jeffrey M. Girard
2. Daiil Jun
3. Desmond Ong
4. Einat Liebenthal
5. Justin T. Baker
This article has no evaluationsLatest version Dec 23, 2025
Cognitive Discourse Analysis can be up-scaled using Sentiment Analysis

This article has 3 authors:
1. Leena Sarah Farhat
2. Simon Willcock
3. William John Teahan
This article has no evaluationsLatest version Jan 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

Sentiment Analysis of Naturalistic Speech Using Open-Weight Large Language Models

Cognitive Discourse Analysis can be up-scaled using Sentiment Analysis