Intrinsic Thematic Separability in Sentence-Transformer Embeddings: A Controlled Geometric Study from Synthetic to Real-World Text

Miguel Pavón

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We investigate whether sentence-transformer embeddings encode sufficient topical information to recover imposed thematic structure in corpora of varying origin and difficulty. A synthetic corpus of 8,000 ChatGPT-generated texts across four domains achieves perfect macro-group recovery in 768D ($K$-Means ARI\,=\,1.000), demonstrating that separability is intrinsic to the embeddings and independent of UMAP projection. UMAP\,+\,HDBSCAN provides a complementary micro-topic decomposition into 66 thematically pure sub-clusters. Perfect recovery generalises across embedding architectures (BGE-large 1024D) and across LLM sources (ChatGPT\,+\,Claude: topic ARI\,=\,0.994, source ARI\,$\approx 0$). Benchmarking against AG News, 20 Newsgroups, Reuters-21578 and a human-authored Reddit corpus establishes an empirical performance gradient (ARI from 1.000 to 0.40) that tracks embedding geometry, though bootstrap confidence intervals remain wide at $n = 6$ conditions. A TF-IDF\,+\,SVD sparse baseline matches the synthetic ceiling but underperforms embeddings by $+0.13$ to $+0.43$ ARI on real-world data, confirming that the embedding advantage is specific to naturalistic text. Sub-cluster coherence on AG News is validated via NPMI: all 394 sub-clusters are significantly more coherent than chance (Cohen's $d = 6.03$). A stylistic-vs.-semantic decomposition confirms that nine surface-level features do not explain topical separability; unsupervised clustering does not recover source identity. These results characterise the conditions under which the embedding\,+\,clustering pipeline succeeds, positioning synthetic results as an upper bound and providing practitioners with geometric diagnostics for post-hoc interpretation of clustering performance.

Version published to 10.21203/rs.3.rs-8976073/v1 on Research Square
Mar 6, 2026

Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews

This article has 4 authors:
1. Longying Lai
2. Zhiyuan Cheng
3. Kai Cheng
4. Xiaoxi Qi
This article has no evaluationsLatest version Mar 20, 2026
Attention Amplification in Multilingual LLMs: Why Script Representation Matters

This article has 3 authors:
1. Yash Mishra
2. Suyash Mishra
3. Kedarnath senapati
This article has no evaluationsLatest version Feb 25, 2026
MINT: A Multilingual Indic Neural Transformer for Abstractive Summarization Under Memory Constraints

This article has 1 author:
1. Sameer Kumar Singh
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Do Transformers Always Win? An Empirical Study of Semantic Embeddings for Short-Text E-commerce Reviews

Attention Amplification in Multilingual LLMs: Why Script Representation Matters

MINT: A Multilingual Indic Neural Transformer for Abstractive Summarization Under Memory Constraints