A Hybrid TF–IDF and SBERT Approach for Enhanced Text Classification Performance

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Automated text-similarity and plagiarism detection remain essential for academic integrity and content moderation. This paper presents a reproducible study that evaluates classical TF-IDF feature representations combined with standard classifiers (Logistic Regression, Random Forest, Multinomial Naïve Bayes, and linear Support Vector Machine) and introduces a hybrid TF-IDF + Sentence-BERT (SBERT) feature fusion to address paraphrase-driven cases. Experiments using an 80/20 stratified split on a labeled pairwise corpus show that a linear SVM trained on TF-IDF provides a strong baseline (F1 = 0.871). The proposed hybrid (TF-IDF reduced via TruncatedSVD concatenated with SBERT embeddings) improves semantic detection and achieves an F1 = 0.903 in our controlled experiments. We include implementation details, hyperparameters, an ablation study, explainability examples (SHAP), and reproducibility notes. The results indicate that hybrid sparse + dense feature pipelines can produce substantial gains with modest additional computation compared to full Transformer fine-tuning.

Article activity feed