A Multi-Agent Approach to Generating Context-Rich Gene Sets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Gene sets are collections of genes that share a common biological function, process, or component that can be used to get insight into the biological relevance of genomic data. Databases containing these gene sets aids in a wide array of analytical methods. The results of these methods, such as gene set analysis or phenotype-based gene prioritization, depend on the quality of the gene sets. Despite the extensive literature and genetic data available for constructing these databases, they often lack sufficient biological context. Current curation methods rely on labour-intensive expert manual curation from literature and datasets, as well as automated methods that are not context-aware. Therefore, there is a significant opportunity to utilize publicly available literature to bridge this gap and create more precise gene sets. With the advancement of natural language processing technologies, particularly large language models, this task can be performed more efficiently. In this work, we present a multi-agent system that utilizes the Llama 3, DeepSeek, and Qwen open-source large language models to analyze PubMed abstracts, allowing us to reconstruct gene sets in existing databases that better reflect specific biological contexts. Our approach consists of two pipelines. One verifies the inclusion of genes in a gene set by proof of evidence in the abstracts showing the association between the gene and the gene set. The second pipeline parses through the abstracts to identify genes not already included in the gene set for potential inclusion. To evaluate the proposed approach, we reconstructed a random selection of gene sets within the Human Ontology Phenotype (HPO). Our analysis shows that 149 of these gene sets have a similarity of 65.18% when compared to the original HPO gene sets, aligning well with the current HPO database. Additionally, we found an average of 3.15 new genes not included in the HPO gene sets, each supported by verified literature linking them to their respective gene sets. This highlights that our updated gene set database better reflects the current state of biological findings.