Advancing Sentiment Analysis in Gujarati: Performance Enhancement through a Hybrid Annotation Framework

Neha Shah¹
Preeti Baser²

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Sentiment analysis in low-resource languages such as Gujarati faces considerable difficulties because of the absence of extensive, annotated datasets and restricted linguistic resources. Unlike prior Gujarati sentiment studies limited to small datasets or rule-based methods, we propose an innovative hybrid annotation framework that integrates rule-based lexicon methods with semi-supervised pseudo-labelling and confidence-based filtering to generate a high-quality sentiment dataset specifically for Gujarati news headlines.In the initial phase, a custom sentiment lexicon was developed incorporating Gujarati words, synonyms, and antonyms. This rule-based system annotated over 21,000 headlines and achieved a baseline accuracy of 72.75% using Random Forest with N-Gram features. To further improve performance and scale the dataset, we introduced a semi-supervised pipeline involving manual annotation of 11,625 headlines, training a baseline model, and applying it to pseudo-label over 93,000 unlabelled headlines. Labels with a confidence score of 90% or higher were retained, resulting in a final hybrid dataset of approximately 1,05,000 headlines.Extensive experiments using machine learning models, including Logistic Regression, Naive Bayes, SVM, Random Forest, Bagging, and AdaBoost, revealed that Random Forest with TF-IDF features achieved the highest accuracy of 88.54%. Cross-validation against human-labeled samples confirmed a pseudo-label accuracy exceeding 90%, validating the framework’s reliability.This work not only delivers a significant performance boost for Gujarati sentiment analysis but also provides a replicable annotation methodology for other low-resource languages. Future work will explore deep learning and transformer-based architectures such as mBERT and IndicBERT to further enhance model understanding and performance.

Version published to 10.21203/rs.3.rs-7290249/v1 on Research Square
Jan 6, 2026

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

This article has 2 authors:
1. Shereen Fouad
2. Ezzaldin Alkooheji
This article has no evaluationsLatest version Jan 12, 2026
Cognitive Discourse Analysis can be up-scaled using Sentiment Analysis

This article has 3 authors:
1. Leena Sarah Farhat
2. Simon Willcock
3. William John Teahan
This article has no evaluationsLatest version Jan 12, 2026
CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

This article has 3 authors:
1. Phuong Lam
2. Phan Thi Tuoi
3. Thien Khai Tran
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

Cognitive Discourse Analysis can be up-scaled using Sentiment Analysis

CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion