FAIR Header Reference genome: a TRUSTworthy standard

Adam Wright
Mark D Wilkinson
Christopher Mungall
Scott Cain
Stephen Richards
Paul Sternberg
Ellen Provin
Jonathan L Jacobs
Scott Geib
Daniela Raciti
Karen Yook
Lincoln Stein
David C Molik

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

The lack of interoperable data standards among reference genome data-sharing platforms inhibits cross-platform analysis while increasing the risk of data provenance loss. Here, we describe the FAIR bioHeaders Reference genome (FHR), a metadata standard guided by the principles of Findability, Accessibility, Interoperability and Reuse (FAIR) in addition to the principles of Transparency, Responsibility, User focus, Sustainability and Technology. The objective of FHR is to provide an extensive set of data serialisation methods and minimum data field requirements while still maintaining extensibility, flexibility and expressivity in an increasingly decentralised genomic data ecosystem. The effort needed to implement FHR is low; FHR’s design philosophy ensures easy implementation while retaining the benefits gained from recording both machine and human-readable provenance.

Version published to 10.1093/bib/bbae122
Mar 27, 2024
Arcadia Science
Dec 19, 2023

Yeah, I could see how that could be confusing. We'll update in the new version.

I should note though, we don't talk about it in the paper but we are going to be writing other FAIR-bioHeaders for transcriptomes, proteomes and some of the other things you mentioned, so keep in contact.

Read the original source
Arcadia Science
Dec 19, 2023

It really depends, some maybe? but for the vast majority of reference genomes the sequence header/name will be some take on

> Contig #

or

> Scaffold #

There would be a problem with adding all of the metadata to every sequence header, as it tends to be quite long, is file level information being put at a sequence level or data point level, and would be repeated many times, which could cause errors.

Read the original source
Arcadia Science
Dec 19, 2023

NCBI metadata is not a reproduceable standard, nor is it contained within the FASTA reference genome file itself.

FHR mostly aligns with NCBI. We'd like to work towards a solution with NCBI on this, instead of in contrast to them.

Read the original source
Arcadia Science
Dec 19, 2023

Most of these resources are either involved in some way with AgBioData, or with Alliance for Genome Resources (the entire author line belongs to one, the other, or both). So now begins a long process of meeting with, talking to, and creating material for these resources, but at least we are primed to have those conversations.

As far as buy in, FHR might be some work to implement, buy we are trying to lower that barrier, and we hope that the databases see the benefit of having it.

Read the original source
Arcadia Science
Dec 19, 2023

We're working on it, the current thinking is to return the legacy feature so that the analysis environment is primed for FHR (i.e. this is going to be a multistep years-in-the-making process)

Read the original source
Arcadia Science
Dec 19, 2023

Thanks for the catch, there is no standard way to write a sequence name, this is a wording issue, we will update in the next version.

Read the original source
Arcadia Science
Dec 19, 2023
This is certainly a concern. Really this is the innovation of the entire paper. Imagine you want to make a FASTA level header for reference genome assemblies provenance/metadata, no matter what you do you will have only three options:
1. Store the metadata in a secondary file
2. Store the metadata in the reference genome assembly itself
3. Create a new way to store genome assemblies
Those are the only three options. Option 1 is the least intrusive to existing pipelines, followed by 2, and then 3. But, in order to lessen the risk of data loss (and to really achieve the benefits of having this metadata, the metadata has to be stored in the file). The question then becomes how you do that in the least intrusive way possible. It's going to hurt no matter what, but we can lessen the hurt by using the legacy comment (some libraries already have this, …
This is certainly a concern. Really this is the innovation of the entire paper. Imagine you want to make a FASTA level header for reference genome assemblies provenance/metadata, no matter what you do you will have only three options:

Store the metadata in a secondary file

Store the metadata in the reference genome assembly itself

Create a new way to store genome assemblies

Those are the only three options. Option 1 is the least intrusive to existing pipelines, followed by 2, and then 3. But, in order to lessen the risk of data loss (and to really achieve the benefits of having this metadata, the metadata has to be stored in the file). The question then becomes how you do that in the least intrusive way possible. It's going to hurt no matter what, but we can lessen the hurt by using the legacy comment (some libraries already have this, and it creates a handy secondary character to be able to remove the header if you absolutely need too), we have another character for future proofing incase file level comments become more utilized, we create tooling to remove the header, and we do as much as we can to lessen that hurt, then hopefully the barrier to entry is low enough that the benefits of having the metadata now out way the hurt of implementing it.
Read the original source
Arcadia Science
Dec 19, 2023

We wanted to avoid giving out the timeline in our paper, but yes, we have updated JBrowse 2 (which we talk about) and are currently working with MicroPubs, and to a lesser degree KBase on their pipelines. We also have plans to update upstream some of the more common FASTA reading libraries to get the comments back in, that work should be starting in the Spring.

we'd also like to run some metanalysis on these headers (but of course we'll need adoption to do that).

I think this line talks to a large degree about how community driven standards get adopted, and how much the authors are willing to collaborate and buy in to make this happen.

Read the original source
Arcadia Science
Dec 19, 2023

Thanks! We've spent a lot of time thinking about this.

Read the original source
Arcadia Science
Dec 19, 2023

fixed in outcoming version.

Read the original source
Arcadia Science
Dec 19, 2023

fixed in outcoming newest version

Read the original source
Arcadia Science
Dec 19, 2023

updated in outcoming newest version.

Read the original source
Arcadia Science
Dec 19, 2023

We are going to update this sentence to get around this.

Read the original source
Arcadia Science
Dec 18, 2023

In the future, we will work with the bioinformatics community to adopt standard pipelines to handle FHR-containing FASTA files. This will involve adding logic to existing FASTA software libraries to handle comments.

have you started work around this? have you had any community buy in? do you have realistic timelines and goals around achieving this?

Read the original source
Arcadia Science
Dec 18, 2023

Unfortunately, this is not always the case. Although some modern FASTA-consuming tools recognise and ignore semicolon-based FASTA comments, most do not. Fortunately, it is trivially easy to strip comments out of a FASTA file by removing lines that begin with semicolons. Users of FHR-enabled FASTA files may need to add this preprocessing step to their nucleic acid analysis pipelines before passing the file to downstream tools.

This is a massive drawback and essentially makes the addition of all of this provenance info moot at worst and irrelevant to many FASTA files at best.

Would it be possible to design a metadata standard that didn't rely on a header that would make most tools unable to process the data? Could it be extensible to other types of FASTAs?

Read the original source
Arcadia Science
Dec 18, 2023

)

double parentheses typo

Read the original source
Arcadia Science
Dec 18, 2023

There is no formal way to add additional information to the sequence-level header line.

Can you expand on what you mean by this?

Read the original source
Arcadia Science
Dec 18, 2023

materials the

missing comma

Read the original source
Arcadia Science
Dec 18, 2023

legacy features

Will new tools be compatible with these legacy features? I'll be curious as I continue reading the paper whether you have tried using a FASTA with this header with popular tools (BWA, seqkit, seqtk, samtools, etc)

Read the original source
Arcadia Science
Dec 18, 2023

Several organism-focused genome data portals, such as AgBase (18), FlyBase (19), SoyBase (20), wFleaBase (21), WormBase (22), VectorBase (23), Ensembl (24), and others (25), publish annotations that are not found in the NCBI Assembly database. In some cases, these annotations and associated genomes cannot be submitted due to data ownership conflicts. These genome browsers and data repositories are often associated with a larger consortium that is working to answer questions of interest to the relevant scientific communities. Examples of such consortiums are the i5k (26, 27) Workspace (28), a collaborative effort to annotate arthropod genomes, and the Alliance of Genome Resources (The Alliance) (29) a centralised resource Model Organism resource.

Do you need buy in from each of these communities for your metadata standard to be a …

Several organism-focused genome data portals, such as AgBase (18), FlyBase (19), SoyBase (20), wFleaBase (21), WormBase (22), VectorBase (23), Ensembl (24), and others (25), publish annotations that are not found in the NCBI Assembly database. In some cases, these annotations and associated genomes cannot be submitted due to data ownership conflicts. These genome browsers and data repositories are often associated with a larger consortium that is working to answer questions of interest to the relevant scientific communities. Examples of such consortiums are the i5k (26, 27) Workspace (28), a collaborative effort to annotate arthropod genomes, and the Alliance of Genome Resources (The Alliance) (29) a centralised resource Model Organism resource.

Do you need buy in from each of these communities for your metadata standard to be a success? How do you plan to get that buy in?

Read the original source
Arcadia Science
Dec 18, 2023

Reducing discrepancies between genome references for the “same” organism can be aided by improving our ability to include crucial metadata about the origins of and means by which each genome reference is created in-line with the sequence data itself.

Does NCBI fundamentally not allow for these metadata fields, or are they not inserted by the users who upload the data? I think creating a new metadata standard (as presented here) while in theory could solve some of these issues, compliance by those who upload genomes will always be an issue no matter what standard is used. I think researchers default to not including information when they are unsure about that information, or unsure of themselves at time of upload, which has historically been a rather stressful process.

Read the original source
Arcadia Science
Dec 18, 2023

,

typo :)

Read the original source
Arcadia Science
Dec 18, 2023

Differences can arise when a reference genome is replicated across platforms or devices (e.g. renaming of files or contigs, removal of contigs that fail to meet some criteria such as minimum length, the removal and addition of metadata, etc.) leading to a gradual divergence of reference genome files and their metadata (i.e., the genome data and metadata divergence problem, divergence problems are described by Haslhofer 2010 (13)).

These are really great examples!

Read the original source
Arcadia Science
Dec 18, 2023

all provenance information must come from external sources and be linked to the file name or checksum.

I think NCBI handles much of this by putting the information directly in the FASTA header for each contig. The accession itself given to each contig creates a link but I think it does act as a small tracker of provenance

Read the original source
Arcadia Science
Dec 18, 2023

Schoof 2003 and Niu 2022

These citations follow a different format

Read the original source
Arcadia Science
Dec 18, 2023

transcription

Would you be willing to use a synonym here? After reading the abstract and first paragraph, I'm searching for clues that this metadata standard might also apply to other types of sequencing data (other FASTAs with e.g. assembled transcripts, amino acid sequences, genes, etc), and seeing transcription here and not referring to the central dogma is a little distracting

Read the original source
Version published to 10.1101/2023.11.29.569306 on bioRxiv
Dec 1, 2023

Standardized API Call Protocols for implementing Federated Learning in FAIRDatabase

This article has 3 authors:
1. Sem de Regt
2. Roland V. Bumbuc
3. Vivek M. Sheraton
This article has no evaluationsLatest version Jan 27, 2026
Standardized API Design for Privacy-Preserving Federated Learning in FAIR-Compliant Biomedical Databases

This article has 3 authors:
1. Sem de Regt
2. Roland V. Bumbuc
3. Vivek M. Sheraton
This article has no evaluationsLatest version Feb 3, 2026
BioHackEU25 report: Towards a Robust Validation Service for Data and Metadata in ARC RO-Crates

This article has 14 authors:
1. Eli Chadwick
2. Matthijs Brouwer
3. Kevin Schneider
4. Daniel Arend
5. Finn Bacall
6. Etienne Bardet
7. Sebastian Beier
8. Dominik Brilhaus
9. Xiaoming Hu
10. Emma Le Roy Pardonche
11. Timo Mühlhaus
12. Stuart Owen
13. Cyril Pommier
14. Heinrich Lukas Weil
This article has no evaluationsLatest version Dec 16, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Standardized API Call Protocols for implementing Federated Learning in FAIRDatabase

Standardized API Design for Privacy-Preserving Federated Learning in FAIR-Compliant Biomedical Databases

BioHackEU25 report: Towards a Robust Validation Service for Data and Metadata in ARC RO-Crates