Analysis of Short Texts Using Intelligent Clustering Methods

Jamalbek Tussupov
Akmaral Kassymova
Ayagoz Mukhanova
Assyl Bissengaliyeva
Zhanar Azhibekova
Moldir Yessenova
Zhanargul Abuova

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This article presents a comprehensive review of short text clustering using state-of-the-art methods: Bidirectional Encoder Representations from Transformers (BERT), Term Frequency-Inverse Document Frequency (TF-IDF), and the novel hybrid method Latent Dirichlet Allocation+BERT+Autoencoder (LDA+BERT+AE). The article begins by outlining the theoretical foundation of each technique and their merits and limitations. BERT is critiqued for its capability to understand word dependence in text, while TF-IDF is lauded for its applicability in terms of importance assessment. The experimental section compares the efficacy of these methods in clustering short texts, with a specific focus on the hybrid LDA+BERT+AE approach. A detailed examination of the LDA-BERT model’s training and validation loss over 200 epochs shows that the loss values start above 1.2 and quickly decrease to around 0.8 within the first 25 epochs, eventually stabilizing at approximately 0.4. The close alignment of these curves suggests the model’s practical learning and generalization capabilities, with minimal overfitting. The study demonstrates that the hybrid LDA+BERT+AE method significantly enhances text clustering quality compared to individual methods. Based on the findings, the study recommends the optimum choice and use of clustering methods for different short texts and natural language processing operations. The applications of these methods in industrial and educational settings where successful text handling and categorization are critical are also addressed. The study ends by emphasizing the importance of holistic handling of short texts for deeper semantic comprehension and effective information retrieval.

Version published to 10.20944/preprints202503.2403.v1
Apr 1, 2025

Hierarchical Reinforcement Learning for Adaptive Text Summarization

This article has 1 author:
1. Ahmad Farooq
This article has no evaluationsLatest version Mar 31, 2025
Sentiment Analysis on Code Mixed Telugu-english Text Using Advanced Approaches

This article has 6 authors:
1. Krishna Priya Bhukya
2. Lakshman Sai Kumar Somu
3. Umamaheswararao Nallada
4. Thriveni Muvva
5. Nagendra Sai Kiranu Puvvada
6. Santosh Kumar Uppada
This article has no evaluationsLatest version Mar 20, 2025
Mapping the Unseen in Practice: Comparing Latent Dirichlet Allocation and BERTopic for Navigating Topic Spaces

This article has 5 authors:
1. Pierre Benz
2. Carolina Pradier
3. Diego Kozlowski
4. Natsumi S. Shokida
5. Vincent Larivière
This article has no evaluationsLatest version Feb 26, 2025

Listed in

Abstract

Article activity feed

Related articles

Hierarchical Reinforcement Learning for Adaptive Text Summarization

Sentiment Analysis on Code Mixed Telugu-english Text Using Advanced Approaches

Mapping the Unseen in Practice: Comparing Latent Dirichlet Allocation and BERTopic for Navigating Topic Spaces