Large-scale Manual Curation and Harmonization of Metadata from Metagenomic and Cancer Genomic Repositories: Challenges and Solutions
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Public omics repositories contain vast amounts of valuable data, but their metadata suffers from extreme heterogeneity, unstandardized terminologies, and quality issues that severely limit data reusability and cross-study integration. While prospective metadata standards exist, the majority of published omics data remain in non-standardized formats requiring retrospective curation. We performed comprehensive manual curation and harmonization of clinical metadata from 212,027 samples across 468 studies in two major repositories: curatedMetagenomicData (93 studies, 22,588 samples) and cBioPortal (375 studies, 189,438 samples). Through systematic ontology mapping, we consolidated redundant, dispersed information into much fewer harmonized columns, reduced unique values, and increased the completeness of major attributes. This curation process revealed common metadata quality issues, including typos, inconsistent terminologies, misplaced values, conflicting annotations, and inappropriately merged information across attributes. We document the challenges, decisions, and solutions encountered during large-scale metadata harmonization across two distinct omics domains. The harmonized metadata, accessible through the OmicsMLRepoR Bioconductor package, enables repository-wide queries and cross-study analyses previously challenging with heterogeneous metadata. Our experience provides practical guidance for similar curation efforts and demonstrates the value of investing in retrospective metadata improvement for existing public omics resources.