Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Patient-reported health data, especially Patient-Reported Outcomes Measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This study presents a weakly-supervised pipeline for training and evaluating medical Named Entity Recognition (NER) models across eight languages, with a focus on low-resource settings. A merged English medical corpus, annotated using the Stanza i2b2 model, was translated into German, Greek, Spanish, Italian, Portuguese, Polish, and Slovenian, preserving entity annotations (PROBLEM, TEST, TREATMENT). Data augmentation addressed class imbalance, and fine-tuned BERT-based models consistently outperformed baselines. The English model achieved the highest F1 score (80.07%), followed by German (78.70%), Spanish (77.61%), Portuguese (77.21%), Slovenian (75.72%), Italian (75.60%), Polish (75.56%), and Greek (69.10%). Compared to existing baselines, our models demonstrated notable performance gains, particularly in English, Spanish, and Italian. This research underscores the feasibility and effectiveness of weakly supervised, multilingual approaches for medical entity extraction, contributing to improved information access in clinical narratives—especially in under-resourced languages.

Article activity feed