Persistent hindrances to data re-use in single-cell genomics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We report on our experience attempting to re-use published and publicly available single-cell (or single-nucleus) RNA-sequencing studies (scRNA-seq) from the Gene Expression Omnibus (GEO). We screened GEO for human, mouse and rat scRNA-seq studies as potential candidates for inclusion in the Gemma database of re-annotated and re-analyzed transcriptome studies. Using semi-automated and manual curation, we assessed whether GEO datasets included cell-level expression count matrices and cell-type annotations. We found that there are steep challenges to data reuse. Only ∼40% of studies provided readily usable processed count data that could be reliably mapped to GEO metadata, and fewer than 10% included author-provided cell-type annotations. While raw sequencing data were available for the majority of studies, only a small proportion could be re-analyzed automatically without reliance on heuristics. Our findings show that existing practices for single-cell RNA-sequencing data distribution and sharing are insufficient for effective reuse, and highlight the urgent need for repositories to strengthen and enforce submission requirements, particularly for processed data and cell-type annotations.