scBaseCount: an AI agent-curated, uniformly processed, and autonomously updated single cell data repository

Nicholas D. Youngblut
Christopher Carpenter
Arshia Nayebnazar
Abhinav Adduri
Rohan Shah
Chiara Ricci-Tam
Jaanak Prashar
Rajesh Ilango
Noam Teyssier
Silvana Konermann
Patrick D. Hsu
Alexander Dobin
Dave P. Burke
Hani Goodarzi
Yusuf H. Roohani

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Single-cell RNA sequencing has transformed cell biology by enabling precise transcriptomic measurements of individual cells. The Sequence Read Archive (SRA) is the largest public repository of sequencing reads, yet much of it remains underutilized due to unstandardized metadata and the cost of processing reads. Here, we introduce scBaseCount, a single-cell RNA sequencing database that leverages an AI agent to automate discovery and metadata extraction, and standardize data processing. Built by directly mining all 10x Genomics datasets from SRA, scBaseCount is the largest freely accessible public repository of single-cell gene expression data, comprising over 502 million cells across 27 organisms and 75 tissues, offering an unbiased view of the composition of data within SRA. Uniform processing enables measurement of both intronic and exonic reads, non-coding gene expression and improves alignment across experiments as well as the performance of AI models trained on this phenotypically diverse data. Moreover, scBaseCount provides a blueprint for how AI can be leveraged to curate and autonomously update large biological data repositories.

Version published to 10.1101/2025.02.27.640494 on bioRxiv
Mar 4, 2025

LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine

This article has 9 authors:
1. Sajib Acharjee Dip
2. Dipanwita Mallick
3. Uddip Acharjee Shuvo
4. Shovito Barua Soummo
5. Fazle Rafsani
6. Bikash Kumar Paul
7. Nazifa Ahmed Moumi
8. Shafayat Ahmed
9. Liqing Zhang
This article has no evaluationsLatest version Dec 16, 2025
Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

This article has 2 authors:
1. Xiuwei Zhang
2. Yuqi Cheng
This article has no evaluationsLatest version Dec 10, 2025
Accurate, scalable, and unified single-cell atlas integration with scBIOT

This article has 2 authors:
1. Haihui Zhang
2. Peiwu Qin
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

Accurate, scalable, and unified single-cell atlas integration with scBIOT