Benchmarking the Impact of Data Leakage on the Performance of Knowledge Graph Embedding Models for Biomedical Link Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In recent years, Knowledge Graphs (KGs) have received increasing attention for their ability to organize complex biomedical knowledge into structured representations of entities and relations. Knowledge Graph Embedding (KGE) models facilitate efficient exploration of KGs by learning compact data representations. These models are increasingly applied to biomedical KGs for link prediction, for instance to uncover new therapeutic uses for existing drugs.

While numerous KGE models have been developed and benchmarked for link prediction, existing evaluations often over-look the critical issue of data leakage. Data leakage leads the model to learn patterns it would not encounter when deployed in real-world settings, artificially inflating performance metrics and compromising the overall validity of benchmark results. In machine learning, data leakage can arise when (1) there is redundancy between the training and test sets, (2) the model leverages illegitimate features, or (3) the test set does not accurately reflect real-world inference scenarios. In this study, we demonstrate the impact of train-test redundancies in KGE-based link prediction and implement a systematic control procedure to detect and remove those redundancies. In addition, through permutation experiments, we investigate whether node degree acts as an illegitimate predictive feature, and find no evidence that models rely on it. Finally, we evaluate how well common test set sampling strategies reflect the challenges of real-world inference in drug repurposing. We compare random and cold-start data splits with a real-world inference set derived from the Orphanet database. We find that performance drops significantly on the real-world inference set, indicating that current benchmarking practices may overestimate how well KGE models generalize to practical applications. Overall, our findings highlight the importance of rigorous benchmark design and careful evaluation of the generalization ability of KGE models for biomedical link prediction.

Article activity feed