Assessing species coverage and assembly quality of rapidly accumulating sequenced genomes

This article has been Reviewed by the following groups

Read the full article

Abstract

Background

Ambitious initiatives to coordinate genome sequencing of Earth's biodiversity mean that the accumulation of genomic data is growing rapidly. In addition to cataloguing biodiversity, these data provide the basis for understanding biological function and evolution. Accurate and complete genome assemblies offer a comprehensive and reliable foundation upon which to advance our understanding of organismal biology at genetic, species, and ecosystem levels. However, ever-changing sequencing technologies and analysis methods mean that available data are often heterogeneous in quality. To guide forthcoming genome generation efforts and promote efficient prioritization of resources, it is thus essential to define and monitor taxonomic coverage and quality of the data.

Findings

Here we present an automated analysis workflow that surveys genome assemblies from the United States NCBI, assesses their completeness using the relevant BUSCO datasets, and collates the results into an interactively browsable resource. We apply our workflow to produce a community resource of available assemblies from the phylum Arthropoda, the Arthropoda Assembly Assessment Catalogue. Using this resource, we survey current taxonomic coverage and assembly quality at the NCBI, examine how key assembly metrics relate to gene content completeness, and compare results from using different BUSCO lineage datasets.

Conclusions

These results demonstrate how the workflow can be used to build a community resource that enables large-scale assessments to survey species coverage and data quality of available genome assemblies, and to guide prioritizations for ongoing and future sampling, sequencing, and genome generation initiatives.

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac006), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Surya Saha

    The publication describes a useful tool to quickly survey a range of QC metrics for genomes available in NCBI. The a3cat toolkit can be used to setup as well as update the assessment results for public or private assemblies for a user-defined taxon. Overall, the website and the workflow on gitlab are a useful resource for the genomics community ask a number of comparative genomics questions. I enjoyed reading this manuscript and only have minor comments. I would like to bring some more use cases to the attention of the authors that can enrich the discussion.

    The authors have already presented nuggets from the data mining of results but here are a few thoughts to add to the value of results reported here, as that can be further improved. Given an assembly from an insect with an approximate taxonomic classification based on morphology or genetic markers, can the a3cat results be used to figure out the best reference genome or a set of closely related genomes for comparative analysis of the gene space? One idea could be to use the overlap of lineage specific BUSCO genes found in the new genome with BUSCO genes present in other assemblies to identify related genomes.

    The discussion covers results when the results are filtered by level (contig, scaffold, chromosome) or type (haploid, principal or alternate pseudohaplotype). It might be worthwhile to further segment the results based on input raw data (for e.g. short reads, short reads + mate pair, long reads) to explore if the contiguity of the assembly and completeness and duplication of the gene space is impacted by the proportion of indels in the raw reads irrespective of the length of the reads. There a number of other relevant variables like assembly algorithm and parameters but that can lead to very sparse data. The authors talk about the proportion of repeat content in larger genomes. This might be a valuable resource to add to the a3cat results as initiatives like Ag100Pest and DToL are producing high quality insect genomes >1-2Gbp with a large number of repeats that are going to be better assembled than ever before with high fidelity long reads. Adding the results of a widely used de novo repeat identification tool like RepeatModeler based on the DFAM database will provide a consistent measure of repeat content across all analyzed genomes and add to the value of this toolkit. In case some of this information is already available in NCBI, it can be pulled using the API avoiding the need for this massive compute job.

    This next issue is related to BUSCO but effects the results and conclusions of the a3cat tool. Is it possible that some of the BUSCO marker genes (from OrthoDB9 or 10) are based on short read assemblies with minor errors in gene models? When run on recent assemblies based on high fidelity long reads with the correctly assembled gene model, BUSCO might report the marker as missing or fragmented. I understand this outside the scope of this paper but if this is possible, it should be mentioned as a potential pitfall.

    A common problem with bioinformatics resources is the lack of a sustainability plan. I know this is difficult to pin down for the mid or long term in the face of unpredictable funding but I would like to encourage the authors to present a plan to manage and update the web resource if at all possible. For future work, it might be a good idea to consider the extension of the a3cat toolkit to include other metrics beyond the current contiguity and gene space completeness measures. Mash or ANI distances are becoming computationally tractable for large data sets. I have already mentioned the repeat content issue. Long range similarity measures based on Hi-C data or nucleotide composition based on kmer analysis might be other items to ponder.

    Minor revisions

    Since the logic and applicability of this work is so straightforward, some of the text can be shortened to reduce duplication. For e.g. on Pg 4 this paragraph can be shortened, "Using their Complete Proteome…. for selected groups of species from their field of interest." In the same paragraph, I see "(i) aid project design, particularly in the context of comparative genomics analyses; (ii) simplify comparisons of the quality of their own data with that of existing assemblies; and (iii) provide a means to survey accumulating genomics resources of interest to their ongoing research projects." Can the difference between (i) and (iii) be clearly explained?

    Typographical errors

    On Pg 8, the abbreviation CoL- needs an explanation.

    On Pg 12, can the term span be elaborated?

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac006), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Stephen Richards

    We are now entering a period with rapidly increasing numbers arthropod genome assemblies. Quality has vastly improved because of new high quality long read technologies, but still has a chance to be uneven.

    Comparative genomics requires at least some effort to ensure the datasets are comparable. Here the authors have produced a nice tool to help find sequenced arthropod genomes and compare their quality.

    They use their previous experience with BUSCO to measure quality, and overall I expect will be using this resource quite a lot.

    I also expect a lot of people will use this resource to identify high quality assemblies for comparative analysis.

    One possible plot that would be useful would be completeness plots - things like number of orders with a representative, families etc, partly to show progress, and partly so missing taxa can be easily identified.

    The manuscript is well written, but more importantly the data and methods are easily accessed, and everything is well written up.

    The tool and website does what it says on the tin, and I can't really see any reason not to publish rapidly.