genomesizeR: An R package for genome size prediction

Celine Mercier
Joane Elleouet
Loretta Garrett
Steve A Wakelin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The genome size of organisms present in an environment can provide many insights into evolutionary and ecological processes at play in that environment. The genomic revolution has enabled a rapid expansion of our knowledge of genomes in many living organisms, and most of that knowledge is classified and readily available in the databases of the National Center for Biotechnology Information (NCBI). The genomesizeR tool leverages the wealth of taxonomic and genomic information present in NCBI databases to infer the genome size of Archeae, Bacteria, or Eukaryote organisms identified at any taxonomic level. This R package uses statistical modelling on data from the most up-to-date NCBI databases and provides three statistical methods for genome size prediction of a given taxon, or group of taxa. A straightforward ‘weighted mean’ method identifies the closest taxa with available genome size information in the taxonomic tree, and averages their genome sizes using weights based on taxonomic distance. A frequentist random effect model uses nested genus and family information to output genome size estimates. Finally a third option provides predictions from a distributional Bayesian multilevel model which uses taxonomic information from genus all the way to superkingdom, therefore providing estimates and uncertainty bounds even for under-represented taxa.

All three methods use:

A list of queries; a query being a taxon or a list of several taxa. The package was designed to make it easy to use with data coming from environmental DNA experiments, but works with any table of taxa.
A reference database containing all the known genome sizes, built from the NCBI databases, with associated taxa, provided in an archive to download.
A taxonomic tree structure as built by the NCBI, provided in the same archive.

genomesizeR retrieves the taxonomic classification of input queries, estimates the genome size of each query, and provides 95% confidence intervals for each estimate.

Version published to 10.1101/2024.09.08.611926 on bioRxiv
Sep 13, 2024

Verticall: A fast and robust tool for recombination detection in large-scale bacterial genomic datasets

This article has 3 authors:
1. Erkison Ewomazino Odih
2. Ryan R. Wick
3. Kathryn E. Holt
This article has no evaluationsLatest version Apr 24, 2026
A novel method to select Reference Proteomes in UniProt

This article has 11 authors:
1. Pedro Raposo
2. Juan Sebastian Martinez Marin
3. Gyuri Kim
4. Giuseppe Insana
5. Dushyanth Jyothi
6. Jie Luo
7. Tanushree Tunstall
8. UniProt Consortium
9. Sandra Orchard
10. Martin Steinegger
11. Maria Martin
This article has no evaluationsLatest version May 14, 2026
16S rRNA sequence captures microbial functional potential

This article has 3 authors:
1. Jia Liu
2. M. Clara De Paolis Kaluza
3. Yana Bromberg
This article has no evaluationsLatest version Apr 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Verticall: A fast and robust tool for recombination detection in large-scale bacterial genomic datasets

A novel method to select Reference Proteomes in UniProt

16S rRNA sequence captures microbial functional potential