Document embeddings for long texts from Transformers and Autoencoders

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In document categorization, the escalating volume of digital data underscores the critical need for advanced techniques in generating semantically rich document embedding, particularly for topic modelling applications. This article presents a novel methodology that leverages the capabilities of Sentence-BERT (SBERT) for sentence embedding, aiming to address the challenges associated with embedding longer texts. The process begins by decomposing documents into individual sentences that can be processed using the SBERT model which generates an embedding for each sentence. Subsequently, a clustering algorithm is employed to select representative sentence embeddings for each document. These embeddings serve as the foundation for constructing comprehensive document embeddings through an autoencoder. This research exclusively explores the autoencoder approach, discarding alternative methods for brevity and focus. The autoencoder strategy showed promising results, occasionally outperforming the base model, Doc2Vec, in specific scenarios. This highlights the method's potential effectiveness in document embedding tasks where understanding long context is important. Moreover, when the Transformer Autoencoder was fine-tuned on test data, it achieved performance comparable to that of Doc2Vec, underscoring the viability of this approach. The findings suggest considerable room for improvement in various aspects of the methodology, including the clustering process and the encoder architecture. This article underscores the potential of autoencoders in advancing the state-of-the-art in document embedding technologies, setting the stage for further refinement and exploration.

Article activity feed