AllTheBacteria – all bacterial genomes assembled, available, and searchable

Martin Hunt
Leandro Lima
Daniel Anderson
George Bouras
Michael Hall
Jane Hawkey
Oliver Schwengers
Wei Shen
John A. Lees
Zamin Iqbal

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The bacterial sequence data publicly available via the global DNA archives is a vast potential source of information on the evolution of bacteria. However, most of this sequence data is unassembled, or where assembled was done so with no consistent assembler or quality control. Although this data has great potential, these inconsistencies make it unsuitable for large-scale analyses, and inaccessible for most researchers to reuse. Therefore in our previous effort, we released a uniformly assembled set of 661,405 genomes, consisting of all publicly available whole genome sequenced bacterial isolate data up to a cutoff of November 2018, enriched with various search indexes to make the data easier to sort and use. In this study, we first extend the dataset up to August 2024 with the same consistent assembly pipeline, more than tripling the number of genomes available. We also expand the scope of the dataset beyond genomes, as we begin a global collaborative project to generate annotations, species-specific analyses, evolutionary data, new search indices, and protein structural data. Our collaboration is therefore grass-roots, driven by the needs of different research communities within microbiology.

In this paper, we describe the project as of release 2024-08, comprising 2,440,377 assemblies. All 2.4 million genomes have been uniformly reprocessed for quality criteria and to give taxonomic abundance estimates with respect to the GTDB phylogeny. We further enrich the dataset with sequence annotations from Bakta, antimicrobial resistance predictions from AMRFinderPlus, and AlphaFold2 protein structure predictions for the 17.7M hypothetical proteins. By applying an evolution-informed compression approach, the full set of genomes is just 130Gb: a reduction of ^∽ 23x compared to compressing individual assemblies. To make the resource as accessible as possible, we also provide multiple search indexes, a method for alignment to the full dataset, and cloud-based access to all the genomes.

The AllTheBacteria data ( https://allthebacteria.org/ ) has already been independently used in multiple other analyses – our goal is to make this a self-sustaining community-driven resource, which increases the accessiblity and reuse of bacterial genomes for a large range of purposes.

Version published to 10.1101/2024.03.08.584059 on bioRxiv
Mar 11, 2024

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

This article has 7 authors:
1. Grazia Visci
2. Elisabetta Notario
3. Giuseppe Defazio
4. Mariano Francesco Caratozzolo
5. Bruno Fosso
6. Marinella Marzano
7. Graziano Pesole
This article has no evaluationsLatest version Jan 30, 2026
Divergent Bacteriophages from Wastewater Reveal an Open Pan-Genome with No Shared Gene Families

This article has 4 authors:
1. Malihe Hamidzade
2. Kimia Sharifian
3. Seyed Jalal Kiani
4. Alieza Mohebbi
This article has no evaluationsLatest version Dec 19, 2025
Comparative phenotypic and genomic analysis of the methanogen Methanomethylovorans thermophila L2FAW and its phylogenomic placement within the Genome Taxonomy Database

This article has 4 authors:
1. Mathias Wunderer
2. Andja Mullaymeri
3. Andreas O. Wagner
4. Eva Maria Prem
Reviewed by Access Microbiology

This article has 5 evaluationsLatest version Jan 22, 2026Latest activity Feb 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

Divergent Bacteriophages from Wastewater Reveal an Open Pan-Genome with No Shared Gene Families

Comparative phenotypic and genomic analysis of the methanogen Methanomethylovorans thermophila L2FAW and its phylogenomic placement within the Genome Taxonomy Database