Advancing Sentiment Analysis in Gujarati: Performance Enhancement through a Hybrid Annotation Framework

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Sentiment analysis in low-resource languages such as Gujarati faces considerable difficulties because of the absence of extensive, annotated datasets and restricted linguistic resources. Unlike prior Gujarati sentiment studies limited to small datasets or rule-based methods, we propose an innovative hybrid annotation framework that integrates rule-based lexicon methods with semi-supervised pseudo-labelling and confidence-based filtering to generate a high-quality sentiment dataset specifically for Gujarati news headlines.In the initial phase, a custom sentiment lexicon was developed incorporating Gujarati words, synonyms, and antonyms. This rule-based system annotated over 21,000 headlines and achieved a baseline accuracy of 72.75% using Random Forest with N-Gram features. To further improve performance and scale the dataset, we introduced a semi-supervised pipeline involving manual annotation of 11,625 headlines, training a baseline model, and applying it to pseudo-label over 93,000 unlabelled headlines. Labels with a confidence score of 90% or higher were retained, resulting in a final hybrid dataset of approximately 1,05,000 headlines.Extensive experiments using machine learning models, including Logistic Regression, Naive Bayes, SVM, Random Forest, Bagging, and AdaBoost, revealed that Random Forest with TF-IDF features achieved the highest accuracy of 88.54%. Cross-validation against human-labeled samples confirmed a pseudo-label accuracy exceeding 90%, validating the framework’s reliability.This work not only delivers a significant performance boost for Gujarati sentiment analysis but also provides a replicable annotation methodology for other low-resource languages. Future work will explore deep learning and transformer-based architectures such as mBERT and IndicBERT to further enhance model understanding and performance.

Article activity feed