Taming the reference genome jungle: the refget sequence collection standard

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Reference genomes are foundational to genomics but suffer from widespread ambiguity and incompatibility due to inconsistent naming, undocumented differences, and lack of formal mechanisms for comparison. To address this, we introduce the GA4GH refget Sequence Collections (seqcol) standard. Refget seqcol is a framework for unambiguous representation, retrieval, and comparison of sequence collections such as reference genomes and transcriptomes. The seqcol standard comprises four components: a structured data schema, a canonical encoding algorithm that produces content-based, globally unique identifiers, a retrieval API, and a comparison protocol. This standard enables precise identification of sequence collections, even across decentralized or private systems, and allows compatibility assessments beyond exact identity, such as order-relaxed matches or shared coordinate systems. We applied the refget seqcol standard to 60 human and 36 mouse reference genomes sourced from major providers. Using digest-based comparisons, we quantified levels of similarity across attributes including sequence names, lengths, coordinate systems, and actual sequence content. Our analysis revealed some consistent subsets of sequences or coordinate systems, as well as substantial incompatibility among references and duplicate references under different names. To support adoption of refget seqcol, we provide a Python package implementing the full standard, a web API, and a comparison interface allowing users to assess local references against a curated database. This work offers a scalable, reproducible solution to the reference genome compatibility crisis, enabling improved transparency, reuse, and integration in genomic analyses. Refget seqcol enhances interoperability across tools and datasets, making genomic research more robust and reproducible.

Article activity feed