Entity-centric evaluation of large language model responses for medical question-answering tasks

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Develop a metric for evaluating the clinical alignment and informativeness of large language model (LLM)-generated responses in medical question-answering (QA) tasks.

Materials and methods

We propose EntQA, an entity-centric metric that extracts biomedical entities from patient backgrounds, diagnostic questions and LLM responses using a biomedical named entity recognition model, followed by de-duplication and semantic/lexical matching with thresholds. We computed recall-style coverage scores to quantify entity retention and detect omissions without external resources. We evaluated EntQA on five benchmarks using seven Qwen 2.5 Instruct models (0.5B–72B parameters), comparing it to baselines via Spearman/Kendall correlations with model accuracy at group level, point-biserial correlations at case level, and Spearman correlations with model scaling.

Results

EntQA demonstrated consistent positive alignments with accuracy (group-level Spearman up to 0.9286; case-level point-biserial up to 0.0926) and model scaling (Spearman up to 0.252), outperforming baselines which often showed negative or inconsistent correlations (e.g., BERTScore Spearman -0.9286 with accuracy).

Conclusion

EntQA offers a scalable, interpretable evaluation for LLM medical QA, outperforming traditional metrics in capturing clinical fidelity and supporting trustworthy healthcare AI through applications in fact-checking and model refinement.

Article activity feed