A Hybrid TF–IDF and SBERT Approach for Enhanced Text Classification Performance
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Automated text-similarity and plagiarism detection remain essential for academic integrity and content moderation. This paper presents a reproducible study that evaluates classical TF-IDF feature representations combined with standard classifiers (Logistic Regression, Random Forest, Multinomial Naïve Bayes, and linear Support Vector Machine) and introduces a hybrid TF-IDF + Sentence-BERT (SBERT) feature fusion to address paraphrase-driven cases. Experiments using an 80/20 stratified split on a labeled pairwise corpus show that a linear SVM trained on TF-IDF provides a strong baseline (F1 = 0.871). The proposed hybrid (TF-IDF reduced via TruncatedSVD concatenated with SBERT embeddings) improves semantic detection and achieves an F1 = 0.903 in our controlled experiments. We include implementation details, hyperparameters, an ablation study, explainability examples (SHAP), and reproducibility notes. The results indicate that hybrid sparse + dense feature pipelines can produce substantial gains with modest additional computation compared to full Transformer fine-tuning.