AllTheBacteria – all bacterial genomes assembled, available, and searchable

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The bacterial sequence data publicly available via the global DNA archives is a vast potential source of information on the evolution of bacteria. However, most of this sequence data is unassembled, or where assembled was done so with no consistent assembler or quality control. Although this data has great potential, these inconsistencies make it unsuitable for large-scale analyses, and inaccessible for most researchers to reuse. Therefore in our previous effort, we released a uniformly assembled set of 661,405 genomes, consisting of all publicly available whole genome sequenced bacterial isolate data up to a cutoff of November 2018, enriched with various search indexes to make the data easier to sort and use. In this study, we first extend the dataset up to August 2024 with the same consistent assembly pipeline, more than tripling the number of genomes available. We also expand the scope of the dataset beyond genomes, as we begin a global collaborative project to generate annotations, species-specific analyses, evolutionary data, new search indices, and protein structural data. Our collaboration is therefore grass-roots, driven by the needs of different research communities within microbiology.

In this paper, we describe the project as of release 2024-08, comprising 2,440,377 assemblies. All 2.4 million genomes have been uniformly reprocessed for quality criteria and to give taxonomic abundance estimates with respect to the GTDB phylogeny. We further enrich the dataset with sequence annotations from Bakta, antimicrobial resistance predictions from AMRFinderPlus, and AlphaFold2 protein structure predictions for the 17.7M hypothetical proteins. By applying an evolution-informed compression approach, the full set of genomes is just 130Gb: a reduction of 23x compared to compressing individual assemblies. To make the resource as accessible as possible, we also provide multiple search indexes, a method for alignment to the full dataset, and cloud-based access to all the genomes.

The AllTheBacteria data ( https://allthebacteria.org/ ) has already been independently used in multiple other analyses – our goal is to make this a self-sustaining community-driven resource, which increases the accessiblity and reuse of bacterial genomes for a large range of purposes.

Article activity feed