Accurate Clinical Entity Recognition and Code Mapping of Anatomopathological Reports Using BioClinicalBERT Enhanced by Retrieval-Augmented Generation: A Hybrid Deep Learning Approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Anatomopathological reports are largely unstructured, which limits automated data extraction, interoperability, and large-scale research. Manual extraction and standardization are costly and difficult to scale. Objective: We developed and evaluated an automated pipeline for entity extraction and multi-ontology normalization of anatomopathological reports. Methods: A corpus of 560 reports from the Military Hospital of Tunis, Tunisia, was manually annotated for three entity types: sample type, test performed, and finding. The entity extraction utilized BioBERT v1.1, while the normalization combined BioClinicalBERT multi-label classification with retrieval-augmented generation, incorporating both dense and BM25 sparse retrieval over SNOMED CT, LOINC, and ICD-11. The performance was measured using precision, recall, F1-score, and statistical tests. Results: BioBERT achieved high extraction performance (F1: 0.97 for the sample type, 0.98 for the test performed, and 0.93 for the finding; overall 0.963, 95% CI: 0.933–0.982), with low absolute errors. For terminology mapping, the combination of BioClinicalBERT and dense retrieval outperformed the standalone and BM25-based approaches (macro-F1: 0.6159 for SNOMED CT, 0.9294 for LOINC, and 0.7201 for ICD-11). Cohen’s Kappa ranged from 0.7829 to 0.9773, indicating substantial to near-perfect agreement. Conclusions: The pipeline provides robust automated extraction and multi-ontology coding of anatomopathological entities, supporting transformer-based named entity recognition with retrieval-augmented generation. However, given the limitations of this study, multi-institutional validation is needed before clinical deployment.