Accurate Clinical Entity Recognition and Code Mapping of Anatomopathological Reports Using BioClinicalBERT Enhanced by Retrieval-Augmented Generation: A Hybrid Deep Learning Approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Anatomopathological reports are largely unstructured, which limits automated data extraction, interoperability, and large-scale research. Manual extraction and standardization are costly and difficult to scale. Objective: We developed and evaluated an automated pipeline for entity extraction and multi-ontology normalization of anatomopathological reports. Methods: A corpus of 560 reports from the Military Hospital of Tunis, Tunisia, was manually annotated for three entity types: sample type, test performed, and finding. The entity extraction utilized BioBERT v1.1, while the normalization combined BioClinicalBERT multi-label classification with retrieval-augmented generation, incorporating both dense and BM25 sparse retrieval over SNOMED CT, LOINC, and ICD-11. The performance was measured using precision, recall, F1-score, and statistical tests. Results: BioBERT achieved high extraction performance (F1: 0.97 for the sample type, 0.98 for the test performed, and 0.93 for the finding; overall 0.963, 95% CI: 0.933–0.982), with low absolute errors. For terminology mapping, the combination of BioClinicalBERT and dense retrieval outperformed the standalone and BM25-based approaches (macro-F1: 0.6159 for SNOMED CT, 0.9294 for LOINC, and 0.7201 for ICD-11). Cohen’s Kappa ranged from 0.7829 to 0.9773, indicating substantial to near-perfect agreement. Conclusions: The pipeline provides robust automated extraction and multi-ontology coding of anatomopathological entities, supporting transformer-based named entity recognition with retrieval-augmented generation. However, given the limitations of this study, multi-institutional validation is needed before clinical deployment.

Article activity feed