scBaseCamp: an AI agent-curated, uniformly processed, and continually expanding single cell data repository

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Building a virtual model of the cell is an emerging frontier at the intersection of artificial intelligence and biology, aided by the rapid growth of single-cell RNA sequencing data. By aggregating gene expression profiles from millions of cells across hundreds of studies, single cell atlases have provided a foundation for training AI-driven models of the cell. However, reliance on datasets with pre-processed counts limits the size and diversity of these repositories and constrains downstream model training to data curated for divergent purposes. This introduces analytical variability due to differences in the choice of alignment tools, genome references, and counting strategies. Here, we introduce scBaseCamp, a continuously updated single-cell RNA-seq database that leverages an AI agent-driven hierarchical workflow to automate discovery, metadata extraction, and standardized data processing. Built by directly mining and processing all publicly accessible 10X Genomics single-cell RNA sequencing reads, scBaseCamp is currently the largest public repository of single-cell data, comprising over 230 million cells spanning 21 organisms and 72 tissues. Using studies comprised of both single cell and single nucleus sequencing data, we demonstrate that uniform processing across datasets helps mitigate analytical artifacts introduced by inconsistent data processing choices. This standardized approach lays the groundwork for more accurate virtual cell models and serves as a foundation for a wide range of biological and biomedical applications.

Article activity feed