Comparative Analysis of Linguistic and Semantic Features for Text Classification Using NLTK and spaCy

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Text classification remains one of the most common NLP tasks, with applications in spam detection, sentiment analysis, and document categorization. This paper presents a lightweight comparative study of feature extraction techniques using two widely adopted NLP toolkits, NLTK and spaCy, applied to a benchmark dataset from the UCI Machine Learning Repository. By integrating traditional linguistic features (token counts, POS tagging, stopword filtering) with semantic embeddings, we evaluate the effectiveness of each toolkit in building a baseline classification system. Experimental results provide insights into the trade-offs between linguistic preprocessing and modern vectorization methods, offering practical recommendations for small-scale text mining projects.

Article activity feed