Unsupervised text clustering with large language models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Extracting actionable insights from large volumes of unstructured text (such as user comments, online forums, and customer feedback) remains a significant challenge in Natural Language Processing, primarily due to the lack of interpretability in traditional clustering methods. Conventional approaches rely on high-dimensional vector representations that group documents around numerical centroids, often requiring post-hoc analysis to decipher the semantic meaning of each cluster. The advent of Large Language Models (LLMs) has introduced new paradigms for understanding and processing textual data. To overcome the transparency gap, we introduce a novel unsupervised clustering algorithm that leverages the reasoning capabilities of Large Language Models (LLMs) to generate clusters defined by natural language summaries rather than abstract vectors. This method iterates through the dataset to dynamically identify and describe latent topics, effectively addressing the problem of unknown cluster counts (k) while producing immediately interpretable groupings. We validated this approach by analyzing patient inquiries from cancer-related Reddit communities, successfully identifying specific informational needs without pre-existing labels. Our results demonstrate that this embedding-free method achieves high stability compared to traditional Term Frequency-Inverse Document Frequency (TF-IDF) algorithms or embeddings, and facilitates the direct discovery of meaningful patterns within noisy, real-world data. By shifting from geometric proximity to semantic reasoning, this framework offers a powerful tool for mining unstructured digital discussions for coherent, actionable intelligence. We evaluated our algorithm on a dataset of submissions from three cancer-related subreddits and assessed its performance using a modified bootstrapping mechanism to measure clustering stability. Our results demonstrate that the proposed LLM-based approach achieves a high stability score (σ = 0.174) compared to a range of traditional clustering algorithms (0.008 < σ < 0.385). We compared k-means, agglomerative hierarchical clustering, and spectral clustering, applied to both TF-IDF and LLM-based embeddings. Embedding-based clustering outperformed TF-IDF-based clustering significantly across all clustering algorithms. This indicates that directly utilizing LLMs for clustering tasks can produce robust and meaningful groupings of text data without the need for embeddings.

Article activity feed