A Large-Scale Pharmacogenomic Knowledge Graph for Drug-Gene-Variant-Disease Discovery

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Precision therapeutics depends on the ability to reason jointly over genes, variants, drugs, diseases, adverse drug reactions (ADRs), and molecular pathways without contaminating evaluation with future knowledge. I present a large-scale pharmacogenomic knowledge graph (PGx-KG) that integrates PharmGKB, ClinVar, SIDER, and Reactome—harmonized to HGNC, RxNorm, MeSH, and ChEBI identifiers—yielding 3,744,727 nodes and 9,645,367 edges across six major relation families. A leakage-free processing pipeline enforces version-aware chronological splits, publication-date audits, symmetric and transitive consistency checks, and cross-database de-duplication, eliminating temporal violations in held-out audits. As a first benchmark, a bilinear link-prediction model implemented in PyTorch Geometric achieves mean reciprocal rank (MRR) 0.347 (95% bootstrap CI [0.321, 0.368]), Hits@1/3/10 of 0.234/0.417/0.589, and AUROC 0.823 on validation data, with five-fold temporal cross-validation yielding 0.341 ± 0.018 MRR and a 2024 hold-out achieving MRR 0.329. Ranked candidate lists surface clinically relevant hypotheses, including CYP2D6–codeine dosing and HLA-B*15:02–carbamazepine risk, while also proposing pathway-level drug repurposing opportunities for expert review.

Article activity feed