Intrinsic Thematic Separability in Sentence-Transformer Embeddings: A Controlled Geometric Study from Synthetic to Real-World Text
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We investigate whether sentence-transformer embeddings encode sufficient topical information to recover imposed thematic structure in corpora of varying origin and difficulty. A synthetic corpus of 8,000 ChatGPT-generated texts across four domains achieves perfect macro-group recovery in 768D ($K$-Means ARI\,=\,1.000), demonstrating that separability is intrinsic to the embeddings and independent of UMAP projection. UMAP\,+\,HDBSCAN provides a complementary micro-topic decomposition into 66 thematically pure sub-clusters. Perfect recovery generalises across embedding architectures (BGE-large 1024D) and across LLM sources (ChatGPT\,+\,Claude: topic ARI\,=\,0.994, source ARI\,$\approx 0$). Benchmarking against AG News, 20 Newsgroups, Reuters-21578 and a human-authored Reddit corpus establishes an empirical performance gradient (ARI from 1.000 to 0.40) that tracks embedding geometry, though bootstrap confidence intervals remain wide at $n = 6$ conditions. A TF-IDF\,+\,SVD sparse baseline matches the synthetic ceiling but underperforms embeddings by $+0.13$ to $+0.43$ ARI on real-world data, confirming that the embedding advantage is specific to naturalistic text. Sub-cluster coherence on AG News is validated via NPMI: all 394 sub-clusters are significantly more coherent than chance (Cohen's $d = 6.03$). A stylistic-vs.-semantic decomposition confirms that nine surface-level features do not explain topical separability; unsupervised clustering does not recover source identity. These results characterise the conditions under which the embedding\,+\,clustering pipeline succeeds, positioning synthetic results as an upper bound and providing practitioners with geometric diagnostics for post-hoc interpretation of clustering performance.