Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

While transformer-based embeddings like Sentence-BERT have become the de facto standard for text representation, their performance on short, noisy industrial texts has received limited empirical scrutiny. This study compares TF-IDF, Word2Vec, and Sentence-BERT on clustering 100,000 Amazon product reviews. Our findings challenge prevailing assumptions: Word2Vec achieves superior clustering performance with a Silhouette score of 0.1828 versus SBERT's 0.0401, representing a 356% improvement. We attribute this to insufficient text length for contextual modeling, domain mismatch between pre-training corpora and e-commerce reviews, and destabilizing variance in cluster centroids from contextualized representations. For topic modeling, Non-negative Matrix Factorization with Count Vectorizer achieves highest coherence (Cv=0.5836), while Latent Dirichlet Allocation produces most balanced distributions. These results suggest classical methods offer compelling cost-performance advantages for short industrial texts.

Article activity feed