Document embeddings for long texts from Transformers and Autoencoders

Lenos Christou
Agorakis Bompotas
Christos Makris

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In document categorization, the escalating volume of digital data underscores the critical need for advanced techniques in generating semantically rich document embedding, particularly for topic modelling applications. This article presents a novel methodology that leverages the capabilities of Sentence-BERT (SBERT) for sentence embedding, aiming to address the challenges associated with embedding longer texts. The process begins by decomposing documents into individual sentences that can be processed using the SBERT model which generates an embedding for each sentence. Subsequently, a clustering algorithm is employed to select representative sentence embeddings for each document. These embeddings serve as the foundation for constructing comprehensive document embeddings through an autoencoder. This research exclusively explores the autoencoder approach, discarding alternative methods for brevity and focus. The autoencoder strategy showed promising results, occasionally outperforming the base model, Doc2Vec, in specific scenarios. This highlights the method's potential effectiveness in document embedding tasks where understanding long context is important. Moreover, when the Transformer Autoencoder was fine-tuned on test data, it achieved performance comparable to that of Doc2Vec, underscoring the viability of this approach. The findings suggest considerable room for improvement in various aspects of the methodology, including the clustering process and the encoder architecture. This article underscores the potential of autoencoders in advancing the state-of-the-art in document embedding technologies, setting the stage for further refinement and exploration.

Version published to 10.21203/rs.3.rs-5459822/v1 on Research Square
Dec 16, 2024

Leveraging Large Language Models and Embedding Representations for Enhanced Word Similarity Computation

This article has 5 authors:
1. XiaoHong Peng
2. Hongbin Jiang
3. Jing Chen
4. MingXin Liu
5. Xiao Chen
This article has no evaluationsLatest version May 16, 2025
Neural Text Embeddings in Psychological Research: A Guide With Examples in R

This article has 2 authors:
1. Louis Teitelbaum
2. Almog Simchon
This article has no evaluationsLatest version May 20, 2025
Information-Optimized and Adaptive Document Segmentation for Multilingual Knowledge Graphs

This article has 3 authors:
1. Diqi Si
2. Yuwen Wei
3. Leiwu Wen
This article has no evaluationsLatest version Jun 6, 2025

Listed in

Abstract

Article activity feed

Related articles

Leveraging Large Language Models and Embedding Representations for Enhanced Word Similarity Computation

Neural Text Embeddings in Psychological Research: A Guide With Examples in R

Information-Optimized and Adaptive Document Segmentation for Multilingual Knowledge Graphs