The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Digital healthcare generates vast amounts of clinical texts thathold potential for AI-assisted applications. However, existing German biomedicallanguage models either rely on older architectures or are trained on limited data,which may hinder their performance in real-world settings. Methods: To explore the impact of domain adaptation strategies in Germanclinical NLP, we developed a family of domain-specific RoBERTa-based languagemodels, collectively referred to as ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT). To address the lack of large-scaleGerman clinical corpora, we curated a 13.5 GB dataset consisting of scientificpublications, clinical texts, and health-related web content. Additionally, weemployed data augmentation via translation of English clinical corpora. Threedomain adaptation strategies were explored: continued pre-training, pre-trainingfrom scratch, and pre-training with domain-specific vocabulary adaptation. Results: The resulting models were evaluated on three medical named entityrecognition and two text classification tasks. Our models consistently outper-formed four existing general-purpose and medical German models on four out offive tasks. The results demonstrate that the choice of domain adaptation strat-egy significantly influences downstream task performance. Based on the empirical results, pre-training from scratch is effective for highly specialized clinical texts,whereas continued pre-training is suited for more commonly written medicaltexts. Conclusions: ChristBERT establishes a new state-of-the-art for German clinicallanguage modeling. Our findings indicate that the optimal domain adaptationstrategy is task-dependent and remains crucial, as adapted models consistentlyoutperformed general-purpose language models in our experiments. To supportfurther research and application in German medical NLP, all developed modelsare publicly released.

Article activity feed