A Large-Scale Pharmacogenomic Knowledge Graph for Drug-Gene-Variant-Disease Discovery
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Precision therapeutics depends on the ability to reason jointly over genes, variants, drugs, diseases, adverse drug reactions (ADRs), and molecular pathways without contaminating evaluation with future knowledge. I present a large-scale pharmacogenomic knowledge graph (PGx-KG) that integrates PharmGKB, ClinVar, SIDER, and Reactome—harmonized to HGNC, RxNorm, MeSH, and ChEBI identifiers—yielding 3,744,727 nodes and 9,645,367 edges across six major relation families. A leakage-free processing pipeline enforces version-aware chronological splits, publication-date audits, symmetric and transitive consistency checks, and cross-database de-duplication, eliminating temporal violations in held-out audits. As a first benchmark, a bilinear link-prediction model implemented in PyTorch Geometric achieves mean reciprocal rank (MRR) 0.347 (95% bootstrap CI [0.321, 0.368]), Hits@1/3/10 of 0.234/0.417/0.589, and AUROC 0.823 on validation data, with five-fold temporal cross-validation yielding 0.341 ± 0.018 MRR and a 2024 hold-out achieving MRR 0.329. Ranked candidate lists surface clinically relevant hypotheses, including CYP2D6–codeine dosing and HLA-B*15:02–carbamazepine risk, while also proposing pathway-level drug repurposing opportunities for expert review.