Unsupervised text clustering with large language models

Leonid Kuligin
Jacqueline Lammert
Florence Heinkelein
Keno Bressem
Martin Boeker
Maximilian Tschochohei

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Extracting actionable insights from large volumes of unstructured text (such as user comments, online forums, and customer feedback) remains a significant challenge in Natural Language Processing, primarily due to the lack of interpretability in traditional clustering methods. Conventional approaches rely on high-dimensional vector representations that group documents around numerical centroids, often requiring post-hoc analysis to decipher the semantic meaning of each cluster. The advent of Large Language Models (LLMs) has introduced new paradigms for understanding and processing textual data. To overcome the transparency gap, we introduce a novel unsupervised clustering algorithm that leverages the reasoning capabilities of Large Language Models (LLMs) to generate clusters defined by natural language summaries rather than abstract vectors. This method iterates through the dataset to dynamically identify and describe latent topics, effectively addressing the problem of unknown cluster counts (k) while producing immediately interpretable groupings. We validated this approach by analyzing patient inquiries from cancer-related Reddit communities, successfully identifying specific informational needs without pre-existing labels. Our results demonstrate that this embedding-free method achieves high stability compared to traditional Term Frequency-Inverse Document Frequency (TF-IDF) algorithms or embeddings, and facilitates the direct discovery of meaningful patterns within noisy, real-world data. By shifting from geometric proximity to semantic reasoning, this framework offers a powerful tool for mining unstructured digital discussions for coherent, actionable intelligence. We evaluated our algorithm on a dataset of submissions from three cancer-related subreddits and assessed its performance using a modified bootstrapping mechanism to measure clustering stability. Our results demonstrate that the proposed LLM-based approach achieves a high stability score (σ = 0.174) compared to a range of traditional clustering algorithms (0.008 < σ < 0.385). We compared k-means, agglomerative hierarchical clustering, and spectral clustering, applied to both TF-IDF and LLM-based embeddings. Embedding-based clustering outperformed TF-IDF-based clustering significantly across all clustering algorithms. This indicates that directly utilizing LLMs for clustering tasks can produce robust and meaningful groupings of text data without the need for embeddings.

Version published to 10.21203/rs.3.rs-8724768/v1 on Research Square
Feb 23, 2026

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

This article has 3 authors:
1. Antoine Claude Lemor
2. Shannon Dinan
3. Jeremy Gilbert
This article has no evaluationsLatest version Apr 13, 2026
LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

This article has 3 authors:
1. Antoine Claude Lemor
2. Shannon Dinan
3. Jeremy Gilbert
This article has no evaluationsLatest version Apr 13, 2026
A large-scale, granular topic classification system for scientific documents

This article has 3 authors:
1. Gard B. Jenset
2. Peter J. Bevan
3. Akarsh Jain
This article has no evaluationsLatest version Mar 31, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

A large-scale, granular topic classification system for scientific documents