Disambiguating sentiment annotation: Using mixed methods to understand annotator experience and impact of instructions on annotation quality
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Human-annotated datasets are central to the development and evaluation of sentiment analysis and other text classification systems. However, many existing datasets suffer from low annotator agreement and errors, raising concerns about the quality of the data that computational systems should learn from or align with. Improving annotation quality demands close examination of how datasets are created and how annotators interpret and approach the task. To this end, we create AmbiSent, a new sentiment dataset designed to capture cases of interpretive complexity that commonly challenge both annotators and computational models. Using a mixed-methods approach, we investigate how annotation instructions influence label consistency and annotator experience. Two groups of crowdworkers annotated the dataset under either minimal or detailed instructions, allowing comparison of inter- and intra-annotator agreement and annotation rationales. Our findings reveal that detailed instructions alone do not ensure more consistent annotations – either across or within individuals. Their effectiveness appears contingent on participants’ level of task engagement and the extent to which the instructions align with intuitive annotation strategies. Despite the presence of detailed guidance, participants often defaulted to simplifying the task. However, we also observed sentence types where detailed instructions may improve annotation quality, (e.g., sentences with perspective-dependent sentiment, rhetorical questions, and sentences containing sarcasm). A thematic analysis of open-ended responses further contextualised these findings, offering insights into the annotators’ cognitive effort involved and the practical challenges faced. Together, these results inform recommendations for enhancing task engagement and instruction adherence, offering practical insights for future dataset development. Finally, to support diverse use cases, we release three versions of the AmbiSent dataset, each accompanied by detailed annotator information and label distributions to better accommodate different user needs.