The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

Henry He
Johann Frei
Raphael Scheible-Schmitt

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Digital healthcare generates vast amounts of clinical texts thathold potential for AI-assisted applications. However, existing German biomedicallanguage models either rely on older architectures or are trained on limited data,which may hinder their performance in real-world settings. Methods: To explore the impact of domain adaptation strategies in Germanclinical NLP, we developed a family of domain-specific RoBERTa-based languagemodels, collectively referred to as ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT). To address the lack of large-scaleGerman clinical corpora, we curated a 13.5 GB dataset consisting of scientificpublications, clinical texts, and health-related web content. Additionally, weemployed data augmentation via translation of English clinical corpora. Threedomain adaptation strategies were explored: continued pre-training, pre-trainingfrom scratch, and pre-training with domain-specific vocabulary adaptation. Results: The resulting models were evaluated on three medical named entityrecognition and two text classification tasks. Our models consistently outper-formed four existing general-purpose and medical German models on four out offive tasks. The results demonstrate that the choice of domain adaptation strat-egy significantly influences downstream task performance. Based on the empirical results, pre-training from scratch is effective for highly specialized clinical texts,whereas continued pre-training is suited for more commonly written medicaltexts. Conclusions: ChristBERT establishes a new state-of-the-art for German clinicallanguage modeling. Our findings indicate that the optimal domain adaptationstrategy is task-dependent and remains crucial, as adapted models consistentlyoutperformed general-purpose language models in our experiments. To supportfurther research and application in German medical NLP, all developed modelsare publicly released.

Version published to 10.21203/rs.3.rs-7332811/v1 on Research Square
Sep 18, 2025

Doc Bot: The Medical LLM Fine-tuned on LLaMA 3 8B Using LoRA and Insights from the Medical Field

This article has 3 authors:
1. Abdulmalik Habaebi
2. Akeem Olowolayemo
3. Sharyar Wani
This article has no evaluationsLatest version Sep 4, 2025
Medical Abbreviation Disambiguation with Large Language Models: Zero- and Few-Shot Evaluation on the MeDAL Dataset

This article has 4 authors:
1. Nima Shafiei Rezvani Nezhad
2. Meysam Mansouri
3. Rabih Abdulkarim Zakaria
4. Ruhollah Abolhasani
This article has no evaluationsLatest version Sep 17, 2025
Improving Large Language Models with Concept-Aware Fine-Tuning

This article has 5 authors:
1. Dacheng Tao
2. Michael Chen
3. Xikun ZHANG
4. Jiaxing Huang
5. Yingjie Wang
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Doc Bot: The Medical LLM Fine-tuned on LLaMA 3 8B Using LoRA and Insights from the Medical Field

Medical Abbreviation Disambiguation with Large Language Models: Zero- and Few-Shot Evaluation on the MeDAL Dataset

Improving Large Language Models with Concept-Aware Fine-Tuning