Assessing Sarcasm Dataset Quality
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial intelligence (AI) models depend on high-quality data to maintain accuracy and ensure safe deployment. However, the presence of sarcasm in sentiment analysis (SA) poses a unique challenge due to its inherently ambiguous and context-dependent nature, significantly impacting model performance. In this context, sarcasm detection plays a pivotal role in improving SA accuracy. While significant effort has been exerted, most existing sarcasm detection systems face substantial challenges due to poorly annotated datasets and the inherently complex nature of sarcastic language. To address this, we evaluate sarcasm data quality by benchmarking uniformly parameterized models across four distinct datasets: SARC, SemEval2022, NewsHeadline, and Multimodal. We conduct extensive evaluations using a three-model hierarchy: statistical machine learning, deep learning, and transfer learning models, alongside TF-IDF vectorization and word embeddings for text representation.To mitigate bias arising from class imbalance and unequal data distribution, we applied two resampling techniques—oversampling and undersampling—before conducting our experiments. Our findings reveal that the NewsHeadline dataset achieves superior performance, with RoBERTa attaining an F1-score of 0.93. Based on these insights, we compile and release a refined Sarcasm-Quality (SQ) dataset to advance future research in sarcasm-aware NLP systems.