Rapid terabase-scale simulation of realistic metagenomes for experimental design and pathogen detection with RandomReadsMG

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

1.

Accurate simulation of metagenomic sequencing data is determinant for benchmarking new algorithms, guiding experimental design and generation of synthetic data to develop and train models for AI/ML. However, existing simulators often struggle to reproduce the uneven coverage depth patterns seen in real microbial communities, can be difficult to install, or incur long runtimes. Here we present RandomReadsMG , a fast and scalable read simulator designed to generate synthetic metagenomes with realistic, user-defined depth distributions from hundreds of genomes in a single command. The tool supports multiple abundance models and intra-genome depth variability to better mimic real data. Benchmarks show RandomReadsMG runs orders of magnitude faster than comparable software while maintaining constant memory usage for input datasets of unbounded size. We demonstrate its utility by determining pathogen detection thresholds in a complex microbiome, showcasing its value for optimizing experimental design and creating robust training data sets for bioinformatics and AI/ML. RandomReadsMG is open-source software, distributed as part of the BBTools suite. The full software package is available for download at https://sourceforge.net/projects/bbmap/ . For containerized deployment, a Docker image is also available from Docker Hub at https://hub.docker.com/r/bryce911/bbtools .

Article activity feed