Breaking Through Biology's Data Wall: Expanding the Known Tree of Life by Over 10x using a Global Biodiscovery Pipeline

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Advancements in the life sciences have always been built upon our collective understanding of life on Earth. Now, the rise of generative biology - the use of AI foundation models to design, generate, and annotate proteins, pathways and therapeutics - is creating unprecedented demand for large, diverse biological sequence datasets. While a limited subset of such data can be generated in clinical or laboratory settings, the vast majority of the training data for unsupervised models must be sourced from the natural world - the product of nearly four billion years of evolutionary history. However, the public databases that currently supply this data, while foundational to research, were established to aggregate results from academic experiments, not as training datasets for machine learning. Their human-centric data structure limits model performance due to redundancy, taxonomic and geographic bias, limited biological context, and inconsistent provenance. With 68% of all sequence data in the SRA database coming from just 5 species, this is one of the most severe class imbalance problems ever encountered in AI. Legal and infrastructural constraints further exacerbate this bottleneck. To address these limitations and support scalable model training, we introduce BaseData™: the largest and fastest-growing biological sequence database ever built, and the first purpose-built for training foundation models. As of late 2024, BaseData™ contained 9.8 billion novel genes across more than 1 million newly discovered species, representing more than a 10-fold expansion in known protein diversity after accounting for redundancy. Its partnership-driven data supply chain across 26 countries and autonomous regions enables growth of over 2 billion novel genes per month, far exceeding public repositories. All data is collected under benefit sharing agreements using standardized protocols and structured using graph-based, ontology-rich metadata that preserves evolutionary context. BaseData™ represents a new, ethically grounded infrastructure for training biological foundation models, complementing public efforts and enabling the next era of generative biology.

Article activity feed

  1. DNA is sequenced to depths targeted to maximize diversity capture using a combination of Oxford Nanopore and Illumina for long and short reads, respectively, allowing for the generation of high quality and high contiguity genomic assemblies.

    The combination of ONT and Illumina is great - I wondered if you have found a tradeoff of trying to maximise finding diversity, i.e., reads that have differences, but also minimize retaining reads with sequencing errors that look artificially dissimilar. Presumably, walking the line between the two is critical to not over-inflating diversity estimates and retaining only confident 'true' standing diversity - I would love to know more about how you navigate this!

  2. the Basecamp Research supply chain allows royalty disbursements to be triggered at the point of data use and not only at the point of final product commercialisation

    I believe that a profit-sharing model for the country of origin of biodiversity has to be central to the commodification of biological diversity. I am curious about a couple of practical aspects of your implementation of this. Firstly, how do you determine the 'value' and therefore the royalties associated with the point of use of data prior to commercialization (are there some minimum royalties that are immediately owed to the country of origin at the point of use?), and subsequently I couldn't find a description in the manuscript of what constitutes a royalty vs. profit from the use of a sequence. When you say that 100% royalties will go to the data source A when a natural sequence is used, how does this compare with the profit gleaned from products developed from that sequence? Without this clarity, it feels rather obtuse as to how much countries are truly being compensated (my impression is that 'royalties' models of compensation have rightly been long criticized in other sectors due to their opacity and underweighting of small to mid-size contributors).

  3. Each sequence within BaseData is also embedded within a deep metadata layer capturing environmental, chemical, and physical parameters, as well as genomic and metagenomic context.

    Given the strength of biological foundation models will lie in their breadth of understanding, how do you balance sampling previously sparsely/unsampled environments (which presumably contribute substantially to new taxa/sequences) with less unique environments that exhibit more homogenous taxonomic diversity to get an idea of standing patterns of biological variation? I would imagine that capturing that standing variation is also an important component of understanding biology as a whole. Presumably, models will fail to generalize patterns and will overweight the prevalence of novelty in novel environments when they are more selectively sampled than other environments?

  4. This novelty extends beyond sequence space into taxonomic space: BaseData includes over 1 million new species, as defined by unique Operational Taxonomic Units, not found in GTDB or OMG, highlighting its unprecedented contribution to species-level discovery

    Increasing the breadth of sampling to this extent is fantastic. I was wondering whether you have an estimate of the increase in phylogenetic branch length across the data resulting from the addition of these additional taxa. I'm also curious as to whether these 'species' are all microbes or whether you also pick up DNA from macro-organisms, and if so, what the increase in 'traditionally' described species looks like compared to when you use OTUs?

  5. Analyses across diverse commercially-relevant protein families such as recombinases, hydrolases, and ATP synthases demonstrate that BaseData consistently captures more sequence-level and phylogenetic diversity than any existing dataset, finding more potential starting points for biological functions important for the development of new therapeutics and industrial solutions

    It's hard to parse how much phylogenetic diversity BaseData actually adds (and how it's distributed) with just these plots as evidence. Quantification would be useful.

    What proportion of phylogenetic diversity in GTDB/OMG is captured? How much new phylogenetic diversity is contributed? How many novel clades without GTDB/OMG ancestry are added? Crucially, how are these patterns distributed within BaseData? Are phylogenetic diversity gains evenly distributed over proteins/genes?