LLM-Guided Weighted Contrastive Learning with Topic-Aware Masking for Efficient Domain Adaptation: A Case Study on Pulp-Era Science Fiction

Sujin Kang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Domain adaptation of pre-trained language models remains challenging, especially for specialized text collections that include distinct vocabularies and unique semantic structures. Existing contrastive learning methods frequently rely on generic masking techniques and coarse-grained similarity measures, which limit their ability to capture fine-grained, domain-specific linguistic nuances. This paper proposes an enhanced domain adaptation framework by integrating weighted contrastive learning guided by large language model (LLM) feedback and a novel topic-aware masking strategy. Specifically, topic modeling is utilized to systematically identify semantically crucial domain-specific terms, enabling the creation of meaningful contrastive pairs through three targeted masking strategies: single-keyword, multiple-keyword, and partial-keyword masking. Each masked sentence undergoes LLM-guided reconstruction, accompanied by graduated similarity assessments that serve as continuous, fine-grained supervision signals. Experiments conducted on an early 20th-century science fiction corpus demonstrate that the proposed approach consistently outperforms existing baselines, such as SimCSE and DiffCSE, across multiple linguistic probing tasks within the newly introduced SF-ProbeEval benchmark. Furthermore, the proposed method achieves these performance improvements with significantly reduced computational requirements, highlighting its practical applicability for efficient and interpretable adaptation of language models to specialized domains.

Version published to 10.3390/electronics14173351
Aug 22, 2025
Version published to 10.20944/preprints202508.0100.v1
Aug 4, 2025

BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla

This article has 2 authors:
1. Md Saiyem Raiyan
2. Nayeema Ferdous
This article has no evaluationsLatest version Jan 23, 2026
Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

This article has 5 authors:
1. Peng Yang
2. Xiaoguang Wu
3. Yanyang Zhou
4. Qilin Wei
5. Zhifeng Zeng
This article has no evaluationsLatest version Dec 17, 2025
Knowledge and Context Compression via Question Generation

This article has 6 authors:
1. Alex Anvi Eponon
2. Moein Shahiki-Tash
3. Abdullah -
4. Luis Ramos
5. Christian Maldonado-Sifuentes
6. Ildar Batyrshin
This article has no evaluationsLatest version Jan 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla

Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

Knowledge and Context Compression via Question Generation