Interpreting Omics Data Analysis with Large Language Models for Disease Target and Drug Discovery

ZIXI XU
Weihang Chen
Wuyu Ren
Tianqi Xu
Somadina Amaechin
Raad Khan
Yixin Chen
Michael Province
Philip Payne
Fuhai Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In biomedical scientific discovery, synthesizing prior knowledge from the literature is an essential component of interpreting numerical omics data analyses for disease target identification and drug discovery. Large language models (LLMs) alone can rapidly retrieve disease mechanisms from biomedical text, but text-only outputs are general and unreliable for target and drug prioritization without cohort-specific quantitative evidence. Herein, we propose a provenance-aware Text-to-Target framework that couples schema-constrained multi-model LLM retrieval with numeric omics data analysis. The key design is a modality-aware fusion step: candidates are partitioned into overlap-supported anchors, retrieval-only hidden hubs, and network-emergent novelty nodes, then propagated into staged hypothesis and strategy generation under topology constraints. We evaluate the model in Alzheimer's disease (AD) and pancreatic ductal adenocarcinoma (PDAC). In PDAC, the workflow produced a balanced 75-gene candidate universe and a 23-strategy portfolio, with significant DepMap support at both target level and strategy level. In AD, stricter candidate controls yielded a compact 34-gene universe and 14 strategies; under an expanded CRISPRbrain registry, both target-level axes were significant , with strong strategy-level enrichment. Across both diseases, final strategies preserved full provenance closure to the candidate pool, enabling end-to-end auditability from retrieval artifacts to validation outputs. These results support a transferable discovery architecture in which omics evidence constrains biological activity, LLM retrieval expands mechanistic search space, and network-aware fusion preserves interpretability. The framework provides a reproducible basis for dual-disease target prioritization and motivates continuous literature-mechanism concordance with agentic evidence-refresh loops.

Version published to 10.64898/2026.04.30.721768 on bioRxiv
May 5, 2026

Knowledge Inclusive Machine Learning for Disease Gene Prioritisation

This article has 16 authors:
1. Chathura J. Gamage
2. Yu Xia
3. Ravisha Rupasinghe
4. Sachith Seneviratne
5. Damith Senanayake
6. Tamasha Malepathirana
7. Asela Hevapathige
8. Mark Corbett
9. Terence J. O’Brien
10. Steven Petrou
11. Samuel F. Berkovic
12. Ingrid E. Scheffer
13. Jozef Gecz
14. Melanie Bahlo
15. Mark F. Bennett
16. Saman Halgamuge
This article has no evaluationsLatest version May 2, 2026
OncoCITE: Multimodal Multi-Agent Reconstruction of Clinical Oncology Knowledge Bases from Scientific Literature

This article has 6 authors:
1. Mujahid Quidwai
2. Santiago Thibaud
3. Dennis Shasha
4. Sundar Jagannath
5. Samir Parekh
6. Alessandro Laganà
This article has no evaluationsLatest version Mar 31, 2026
Epigenomic Biomarker Discovery from Biomedical Literature: AI and Text Mining Toward Health Monitoring Frameworks

This article has 4 authors:
1. Ji-Hye Oh
2. Hee-Jo Nam
3. Soo Hyun Seo
4. Hyun-Seok Park
This article has no evaluationsLatest version May 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Knowledge Inclusive Machine Learning for Disease Gene Prioritisation

OncoCITE: Multimodal Multi-Agent Reconstruction of Clinical Oncology Knowledge Bases from Scientific Literature

Epigenomic Biomarker Discovery from Biomedical Literature: AI and Text Mining Toward Health Monitoring Frameworks