Leakage-Safe Diagnostic Sentiment Analysis in Noisy Social Media Streams: Multi-Task Sequence Modelling, Focal-Loss Reasoning, and Transformer Benchmarking

Thompson Ikechukwu
Chika Innocent

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Short-form social media streams provide high-volume evidence of public affect, yet sentiment models trained on such corpora are vulnerable to linguistic noise, class imbalance, and methodological leakage when diagnostic annotations are used as input features. This study reformulates airline tweet sentiment analysis as a leakage-safe diagnostic learning problem. Instead of appending negative-reason labels to tweet text, a shared sequence encoder is trained with a primary three-class sentiment head and a conditional auxiliary reason head whose loss is activated only for true negative tweets. The corrected experimental design evaluates TF-IDF, recurrent, bidirectional recurrent, multi-task, focal-loss, and DistilBERT baselines on a stratified 70/15/15 split of 14,640 cleaned tweets. The strongest classifier is DistilBERT, achieving 83.33% accuracy, 78.20% macro F1, and 83.04% weighted F1 on the held-out test set of 2,196 tweets. The proposed focal-loss diagnostic model, M4-FL, achieves 76.50% accuracy and 69.55% macro F1 while remaining substantially lighter: 1.25 million parameters versus 66.96 million for DistilBERT, and 0.61 ms/tweet versus 85.59 ms/tweet on the same CPU evaluation environment—a 139.4× inference speedup. A paired McNemar test [22] confirms the DistilBERT–M4-FL difference is statistically significant (p = 1.64 × 10 − 13). These results show that contextual transformers dominate pure classification accuracy, while leakage-safe multi-task recurrent models provide a computationally efficient route toward auditable diagnostic reasoning in enterprise text streams. The contribution is a reproducible correction of target leakage and an empirical clarification of the trade-off between classification strength, diagnostic interpretability, and deployment efficiency.

Version published to 10.21203/rs.3.rs-9888776/v1 on Research Square
Jun 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed