Intrinsic Thematic Separability in Sentence-Transformer Embeddings: A Controlled Geometric Study from Synthetic to Real-World Text

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We investigate whether sentence-transformer embeddings encode sufficient topical information to recover imposed thematic structure in corpora of varying origin and difficulty. A synthetic corpus of 8,000 ChatGPT-generated texts across four domains achieves perfect macro-group recovery in 768D ($K$-Means ARI\,=\,1.000), demonstrating that separability is intrinsic to the embeddings and independent of UMAP projection. UMAP\,+\,HDBSCAN provides a complementary micro-topic decomposition into 66 thematically pure sub-clusters. Perfect recovery generalises across embedding architectures (BGE-large 1024D) and across LLM sources (ChatGPT\,+\,Claude: topic ARI\,=\,0.994, source ARI\,$\approx 0$). Benchmarking against AG News, 20 Newsgroups, Reuters-21578 and a human-authored Reddit corpus establishes an empirical performance gradient (ARI from 1.000 to 0.40) that tracks embedding geometry, though bootstrap confidence intervals remain wide at $n = 6$ conditions. A TF-IDF\,+\,SVD sparse baseline matches the synthetic ceiling but underperforms embeddings by $+0.13$ to $+0.43$ ARI on real-world data, confirming that the embedding advantage is specific to naturalistic text. Sub-cluster coherence on AG News is validated via NPMI: all 394 sub-clusters are significantly more coherent than chance (Cohen's $d = 6.03$). A stylistic-vs.-semantic decomposition confirms that nine surface-level features do not explain topical separability; unsupervised clustering does not recover source identity. These results characterise the conditions under which the embedding\,+\,clustering pipeline succeeds, positioning synthetic results as an upper bound and providing practitioners with geometric diagnostics for post-hoc interpretation of clustering performance.

Article activity feed