The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Digital healthcare generates vast amounts of clinical texts thathold potential for AI-assisted applications. However, existing German biomedicallanguage models either rely on older architectures or are trained on limited data,which may hinder their performance in real-world settings. Methods: To explore the impact of domain adaptation strategies in Germanclinical NLP, we developed a family of domain-specific RoBERTa-based languagemodels, collectively referred to as ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT). To address the lack of large-scaleGerman clinical corpora, we curated a 13.5 GB dataset consisting of scientificpublications, clinical texts, and health-related web content. Additionally, weemployed data augmentation via translation of English clinical corpora. Threedomain adaptation strategies were explored: continued pre-training, pre-trainingfrom scratch, and pre-training with domain-specific vocabulary adaptation. Results: The resulting models were evaluated on three medical named entityrecognition and two text classification tasks. Our models consistently outper-formed four existing general-purpose and medical German models on four out offive tasks. The results demonstrate that the choice of domain adaptation strat-egy significantly influences downstream task performance. Based on the empirical results, pre-training from scratch is effective for highly specialized clinical texts,whereas continued pre-training is suited for more commonly written medicaltexts. Conclusions: ChristBERT establishes a new state-of-the-art for German clinicallanguage modeling. Our findings indicate that the optimal domain adaptationstrategy is task-dependent and remains crucial, as adapted models consistentlyoutperformed general-purpose language models in our experiments. To supportfurther research and application in German medical NLP, all developed modelsare publicly released.