Improving Arabic Clinical Question Quality through Domain-Adaptive Masked Language Modeling
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Arabic clinical NLP systems often receive short, vague, or incomplete questions, which yields weak downstream answers even with strong encoders. We address this bottleneck by making question quality a first-class, measurable objective. Using domain-adaptive (continued) pretraining with a masked-language objective (DAPT-MLM) on AHQAD (~ 808k Arabic health Q–A pairs), we adapt two widely used backbones—AraBERT and the generator variant of AraELECTRA—to the lexical, syntactic, and discourse patterns of well-formed medical questions. Evaluation is aligned with the learning signal: we report cross-entropy and perplexity only at masked tokens, top-k accuracy restricted to masked spans, and lexical-diversity measures to discourage formulaic phrasing. A length-controlled test design (Short/Long/Very Long) isolates modeling gains from verbosity. Results show consistent intrinsic improvements for the domain-adapted models; AraBERT-MLM is best overall (macro Top-5 = 0.8392, lowest CE/PPL), outperforming AraBERT (orig.) by + 6.0 pp Top-5 and AraELECTRA (orig.) by + 17.2 pp. A 200-item human study (clinician + linguist) corroborates these gains (mean ± 95% CI: Clarity 4.12 ± 0.18, Fluency 3.68 ± 0.22, Semantic Fidelity 3.15 ± 0.25, Usefulness 3.42 ± 0.21; substantial agreement, κ ≈ 0.77) and highlights residual semantic drifts that inform simple, slot-constrained decoding fixes. Overall, the proposed reformulation module produces more natural and clinically relevant Arabic questions and can be plugged into Arabic clinical QA pipelines as a measurable, tunable front-end.