Sentence representation-based text matching algorithm using prompt learning and contrastive learning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

With the recent global popularity of ChatGPT (Chat Generative Pre-trained Transformer),it signifies that the era of natural language processing (NLP) technology flourishing in human society has arrived. As a fundamental yet indispensable component of NLP technology, text matching techniques have permeated various NLP application scenarios, such as search engines, dialogue systems, large-scale text deduplication, and recommendation systems. In the era of deep learning, particularly after the emergence of the pre-trained language model BERT (Bidirectional Encoder Representations from Transformers), representation-based text matching algorithms built on BERT and its variants have become one of the mainstream text matching algorithms. However, the effectiveness of these algorithms is limited by the low quality of sentence vectors generated by the original BERT. In 2021, the self-supervised contrastive learning algorithm SimCSE was introduced, which significantly improved the quality of sentence vectors encoded by BERT. However, the batch-wise negative sampling method used by SimCSE, due to its strong randomness, may lead to negative samples with low relevance, thereby limiting the amount of information the model can learn. This paper proposes a negative sample construction algorithm based on prompt learning, called PLNSC, which constructs a masked language modeling task. It leverages the powerful language capabilities of pre-trained language models to generate reasonable negative sample representations for each sentence, enabling the model to learn richer information during the contrastive learning process, thereby enhancing the sentence representation capability of the model. This approach improved the average performance of three baseline models across five datasets by 1.93, 0.49, and 0.63, respectively.

Article activity feed