FAIR Header Reference genome: a TRUSTworthy standard

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

The lack of interoperable data standards among reference genome data-sharing platforms inhibits cross-platform analysis while increasing the risk of data provenance loss. Here, we describe the FAIR bioHeaders Reference genome (FHR), a metadata standard guided by the principles of Findability, Accessibility, Interoperability and Reuse (FAIR) in addition to the principles of Transparency, Responsibility, User focus, Sustainability and Technology. The objective of FHR is to provide an extensive set of data serialisation methods and minimum data field requirements while still maintaining extensibility, flexibility and expressivity in an increasingly decentralised genomic data ecosystem. The effort needed to implement FHR is low; FHR’s design philosophy ensures easy implementation while retaining the benefits gained from recording both machine and human-readable provenance.

Article activity feed

  1. Yeah, I could see how that could be confusing. We'll update in the new version.

    I should note though, we don't talk about it in the paper but we are going to be writing other FAIR-bioHeaders for transcriptomes, proteomes and some of the other things you mentioned, so keep in contact.

  2. It really depends, some maybe? but for the vast majority of reference genomes the sequence header/name will be some take on

    > Contig #

    or

    > Scaffold #

    There would be a problem with adding all of the metadata to every sequence header, as it tends to be quite long, is file level information being put at a sequence level or data point level, and would be repeated many times, which could cause errors.

  3. NCBI metadata is not a reproduceable standard, nor is it contained within the FASTA reference genome file itself.

    FHR mostly aligns with NCBI. We'd like to work towards a solution with NCBI on this, instead of in contrast to them.

  4. Most of these resources are either involved in some way with AgBioData, or with Alliance for Genome Resources (the entire author line belongs to one, the other, or both). So now begins a long process of meeting with, talking to, and creating material for these resources, but at least we are primed to have those conversations.

    As far as buy in, FHR might be some work to implement, buy we are trying to lower that barrier, and we hope that the databases see the benefit of having it.

  5. We're working on it, the current thinking is to return the legacy feature so that the analysis environment is primed for FHR (i.e. this is going to be a multistep years-in-the-making process)

  6. This is certainly a concern. Really this is the innovation of the entire paper. Imagine you want to make a FASTA level header for reference genome assemblies provenance/metadata, no matter what you do you will have only three options:

    1. Store the metadata in a secondary file
    2. Store the metadata in the reference genome assembly itself
    3. Create a new way to store genome assemblies

    Those are the only three options. Option 1 is the least intrusive to existing pipelines, followed by 2, and then 3. But, in order to lessen the risk of data loss (and to really achieve the benefits of having this metadata, the metadata has to be stored in the file). The question then becomes how you do that in the least intrusive way possible. It's going to hurt no matter what, but we can lessen the hurt by using the legacy comment (some libraries already have this, and it creates a handy secondary character to be able to remove the header if you absolutely need too), we have another character for future proofing incase file level comments become more utilized, we create tooling to remove the header, and we do as much as we can to lessen that hurt, then hopefully the barrier to entry is low enough that the benefits of having the metadata now out way the hurt of implementing it.

  7. We wanted to avoid giving out the timeline in our paper, but yes, we have updated JBrowse 2 (which we talk about) and are currently working with MicroPubs, and to a lesser degree KBase on their pipelines. We also have plans to update upstream some of the more common FASTA reading libraries to get the comments back in, that work should be starting in the Spring.

    we'd also like to run some metanalysis on these headers (but of course we'll need adoption to do that).

    I think this line talks to a large degree about how community driven standards get adopted, and how much the authors are willing to collaborate and buy in to make this happen.

  8. In the future, we will work with the bioinformatics community to adopt standard pipelines to handle FHR-containing FASTA files. This will involve adding logic to existing FASTA software libraries to handle comments.

    have you started work around this? have you had any community buy in? do you have realistic timelines and goals around achieving this?

  9. Unfortunately, this is not always the case. Although some modern FASTA-consuming tools recognise and ignore semicolon-based FASTA comments, most do not. Fortunately, it is trivially easy to strip comments out of a FASTA file by removing lines that begin with semicolons. Users of FHR-enabled FASTA files may need to add this preprocessing step to their nucleic acid analysis pipelines before passing the file to downstream tools.

    This is a massive drawback and essentially makes the addition of all of this provenance info moot at worst and irrelevant to many FASTA files at best.

    Would it be possible to design a metadata standard that didn't rely on a header that would make most tools unable to process the data? Could it be extensible to other types of FASTAs?

  10. legacy features

    Will new tools be compatible with these legacy features? I'll be curious as I continue reading the paper whether you have tried using a FASTA with this header with popular tools (BWA, seqkit, seqtk, samtools, etc)

  11. Several organism-focused genome data portals, such as AgBase (18), FlyBase (19), SoyBase (20), wFleaBase (21), WormBase (22), VectorBase (23), Ensembl (24), and others (25), publish annotations that are not found in the NCBI Assembly database. In some cases, these annotations and associated genomes cannot be submitted due to data ownership conflicts. These genome browsers and data repositories are often associated with a larger consortium that is working to answer questions of interest to the relevant scientific communities. Examples of such consortiums are the i5k (26, 27) Workspace (28), a collaborative effort to annotate arthropod genomes, and the Alliance of Genome Resources (The Alliance) (29) a centralised resource Model Organism resource.

    Do you need buy in from each of these communities for your metadata standard to be a success? How do you plan to get that buy in?

  12. Reducing discrepancies between genome references for the “same” organism can be aided by improving our ability to include crucial metadata about the origins of and means by which each genome reference is created in-line with the sequence data itself.

    Does NCBI fundamentally not allow for these metadata fields, or are they not inserted by the users who upload the data? I think creating a new metadata standard (as presented here) while in theory could solve some of these issues, compliance by those who upload genomes will always be an issue no matter what standard is used. I think researchers default to not including information when they are unsure about that information, or unsure of themselves at time of upload, which has historically been a rather stressful process.

  13. Differences can arise when a reference genome is replicated across platforms or devices (e.g. renaming of files or contigs, removal of contigs that fail to meet some criteria such as minimum length, the removal and addition of metadata, etc.) leading to a gradual divergence of reference genome files and their metadata (i.e., the genome data and metadata divergence problem, divergence problems are described by Haslhofer 2010 (13)).

    These are really great examples!

  14. all provenance information must come from external sources and be linked to the file name or checksum.

    I think NCBI handles much of this by putting the information directly in the FASTA header for each contig. The accession itself given to each contig creates a link but I think it does act as a small tracker of provenance

  15. transcription

    Would you be willing to use a synonym here? After reading the abstract and first paragraph, I'm searching for clues that this metadata standard might also apply to other types of sequencing data (other FASTAs with e.g. assembled transcripts, amino acid sequences, genes, etc), and seeing transcription here and not referring to the central dogma is a little distracting