A Metadata-Driven Framework for Strengthening Pathogen Genomics Lessons from SARS-CoV-2
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
During the COVID-19 pandemic, large-scale pathogen sequencing generated millions of SARS-CoV-2 genomes deposited in repositories like GenBank and GISAID. However, most of these records lack detailed patient metadata, such as demographics and clinical outcomes, which limits their utility for large-scale pathogen genomics analyses. While records that are linked to a journal publication might contain such metadata, systematic extraction and linkage to sequence records requires substantial manual effort. In this work, we assess the completeness of metadata in GenBank and demonstrate the value of enriched clinical and demographic annotations for genomic epidemiology. We found that on average GenBank records contained only 21.6% of host metadata, and during our study period ∼0.02% of published articles provided accessible sequence-specific patient metadata. Additionally, using published SARS-CoV-2 genomes and their corresponding journal articles, we constructed an analytical use case in pathogen genomics in which host stratification by clinical and demographic factors enables examination of evolutionary dynamics and clinical outcomes. Our results demonstrate how metadata-enrichment enhances pathogen genomic studies and provide a framework applicable to other pathogens.