Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) often fail to correctly associate biomedical terms with their standardized ontology identifiers, posing challenges for downstream applications that rely on accurate, machine-readable codes. These linking failures can compromise the integrity of data used in precision medicine, clinical decision support, and population health. Fine-tuning can partially remedy these issues, but the degree of improvement varies across terms and terminologies. Focusing on the Human Phenotype Ontology (HPO), we show that a model’s prior knowledge of term–identifier pairs, acquired during pre-training, strongly predicts whether fine-tuning will enhance its linking accuracy. We evaluate prior knowledge in three complementary ways: (1) latent probabilistic knowledge, revealed through stochastic prompting, captures hidden associations not evident in deterministic output; (2) partial subtoken knowledge, reflected in incomplete but non-random generation of identifier components; and (3) term familiarity, inferred from annotation frequencies in the biomedical literature, which serve as a proxy for training exposure. We then assess how these forms of prior knowledge influence the accuracy of deterministic identifier linking. Fine-tuning performance varies most for terms in what we call the reactive middle zone of the ontology—terms with intermediate levels of prior knowledge that are neither absent nor fully consolidated. Fine-tuning was most successful when prior knowledge as measured by partial subtoken knowledge, was ‘weak’ or ‘medium’ or when prior knowledge as measured by latent probabilistic knowledge was ‘unknown’ or ‘weak’ (p<0.001). These terms from the ‘reactive middle’ exhibited the largest gains or losses in accuracy during fine-tuning, suggesting that the success of knowledge injection critically depends on the level of term–identifier pair knowledge in the LLM before fine-tuning.

Article activity feed