GenBank2PubMed: Bridging Viral Genomic Data and the Scientific Literature with AI-Assisted Curation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : GenBank entries of pathogenetic viral sequences are typically annotated with host species and epidemiological metadata. However, linking these entries to their corresponding published studies remains labor-intensive. Methods : We developed GenBank2PubMed , a computation pipeline that integrates GenBank sequence data with metadata from published studies. The pipeline aggregates GenBank entries into submission sets based on shared authorship, title similarity, submission dates, and the sequential nature of their accession numbers. Using automated methods, including GPT-4, we linked these submission sets to relevant publications – a challenging task given that many GenBank entries lack citation references. The result is a database in which viral sequences are annotated by host, country, and year of isolation. We also conducted a systematic review to assess how frequently published studies reporting sequences included GenBank submissions. We applied GenBank2PubMed to three high-mortality viruses with outbreak potential: Crimean-Congo Hemorrhagic Fever (CCHF) virus, Lassa virus, and Nipah virus. Results : We identified 193 CCHF virus submission sets (4,754 entries), 78 Lassa virus sets (2,663 entries), and 34 Nipah virus sets (355 entries). Of these, 173 (CCHF), 64 (Lassa), and 31 (Nipah) were linked to published studies. Integration with publication data enriched the contextual and epidemiological metadata for each set. Additionally, our literature review found that 80.1% of CCHF, 86.6% of Lassa, and 87.5% of Nipah virus studies reporting sequences had corresponding GenBank submissions. GenBank submission sets and relational databases for each virus are available at https://hivdb.stanford.edu/genbank2pubmed/; the pipeline is available at https://github.com/hivdb/GenBankRefs. Conclusions : Creating submission sets facilitates the organization of GenBank data into browsable spreadsheets and queryable databases. GPT-4 contributed to linking GenBank entries with published studies and extracting metadata, although manual validation remained essential for accuracy. GenBank2PubMed represents a significant step toward integrating GenBank viral sequences with the scientific literature in which they are reported.

Article activity feed