GEfetch2R : fetching single-cell/bulk RNA-seq data from public repositories to R and benchmarking the subsequent format conversion tools
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.
Findings
Here, we present GEfetch2R , an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R ( SeuratObject / DESeqDataSet ), filter the SeuratObject based on cell metadata and genes, and merge multiple SeuratObjects if applicable; (iii) convert formats between the widely used scRNA-seq objects, including SeuratObject , AnnData , SingleCellExperiment , CellDataSet / cell_data_set , and loom , and benchmark format conversion tools in terms of information kept, usability, running time, and scalability to guide the tool selection. Furthermore, GEfetch2R can also download, process, and load bulk RNA-seq raw data (SRA and ENA) and count matrices (GEO) to R ( DESeqDataSet ).
Conclusions
GEfetch2R is an R package dedicated to facilitating researchers to access and explore the existing gene expression data from various public repositories. It can function as a data downloader (supports all three scRNA-seq and two bulk RNA-seq data types), a data processor (processes and loads the output/downloaded count matrices and annotations to R), and an object format converter (between the widely used scRNA-seq objects).
Article activity feed
-
AbstractBackground Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.Findings Here, we present GEfetch2R, an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R (SeuratObject/DESeqDataSet), filter the SeuratObject based on …
AbstractBackground Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.Findings Here, we present GEfetch2R, an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R (SeuratObject/DESeqDataSet), filter the SeuratObject based on cell metadata and genes, and merge multiple SeuratObjects if applicable; (iii) convert formats between the widely used scRNA-seq objects, including SeuratObject, AnnData, SingleCellExperiment, CellDataSet/cell_data_set, and loom, and benchmark format conversion tools in terms of information kept, usability, running time, and scalability to guide the tool selection. Furthermore, GEfetch2R can also download, process, and load bulk RNA-seq raw data (SRA and ENA) and count matrices (GEO) to R (DESeqDataSet).Conclusions GEfetch2R is an R package dedicated to facilitating researchers to access and explore the existing gene expression data from various public repositories. It can function as a data downloader (supports all three scRNA-seq and two bulk RNA-seq data types), a data processor (processes and loads the output/downloaded count matrices and annotations to R), and an object format converter (between the widely used scRNA-seq objects).
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag039), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2:
General Comments This manuscript introduces a tool named HVRLocator, designed to address the issue of missing or non-standard metadata in 16S rRNA sequencing data found in public databases such as the SRA. The tool identifies amplicon regions by aligning sequences to a reference genome and attempts to detect the presence of primers using a machine learning model. This is a subject with significant practical value, particularly for conducting large-scale meta-analyses. However, there are still many issues regarding methodological rigor, the depth of validation, and comparisons with existing tools that require further clarification by the authors. Major Comments
- Concerns regarding the singularity of the reference sequence The authors mention aligning sequences to a single Escherichia coli (J01859.1) reference genome to determine start and end positions. Is a single E. coli reference sufficient to cover Archaea or bacterial phyla that are distantly related to Proteobacteria, which may be present in environmental samples (e.g., soil, ocean)? For taxa with significant length variations or insertions/deletions (Indels), could forced alignment to the E. coli reference lead to misjudgment of start/end positions? Have the authors evaluated the impact on accuracy if a more universal reference database (such as representative sequences from SILVA or Greengenes) were used?
- Rationality of the primer detection model (Random Forest based on Quality Scores) The authors developed a Random Forest model to predict primer presence by analyzing the quality score distribution of the first 1,000 reads. Primer detection is typically based on the sequence itself rather than quality scores. Can the authors explain why quality scores were chosen as features? Sequencing quality scores are influenced by technical factors such as sequencer status, reagent batches, and run cycles, which have no direct biological correlation with the presence of primers. Is there a risk that this model is "overfitting" specific sequencing platforms or datasets? Since the reads are already downloaded, why not directly use degenerate primer sequence matching (e.g., using Cutadapt or SeqKit logic) to determine primer presence? This seems to be a more direct and accurate method.
- Verification of accuracy claims In the validation section, the authors claim to achieve 100% accuracy on certain datasets. In bioinformatics tool development, a claim of 100% accuracy is often a red flag. Have the authors manually checked those samples marked as "correct" by the model that might suffer from edge effects or borderline cases?
- Dataset imbalance in the Random Forest model For the Random Forest model, the authors used 882 samples with primers and 8,940 samples without primers for training. Such an extremely imbalanced dataset, even with stratified sampling, may cause the model to be biased towards the majority class.
- Comparison with existing tools The manuscript mentions that no tool has been designed for this specific purpose, but this may overlook some existing general-purpose tools or scripts. Many pipelines (such as certain plugins in QIIME 2, USEARCH, etc.) possess functionalities to identify primers or evaluate amplicon regions. The authors should discuss how their tool compares to these existing workflows. Minor Comments
- Confusion regarding processing speed metrics The abstract mentions a processing speed of "0.147 samples per minute", but later the text mentions "6.5 samples per minute" and "one sample every 0.147 minutes". There is confusion regarding units and values in these three descriptions (is it samples per minute or minutes per sample?). Please unify and correct these data to ensure consistency.
- Usage of fastq-dump The use of fastq-dump is mentioned. The SRA Toolkit's fastq-dump is relatively slow and has largely been superseded by fasterq-dump for efficiency. Why did the authors not use the more efficient fasterq-dump?
- Definition of "Standardized metadata" The term "standardized metadata" is used frequently. Please explicitly define what constitutes "standard" metadata in the context of this tool within the text.
- Robustness and error handling The results section mentions that some samples failed due to "NCBI portal-related issues". Does this imply the tool lacks breakpoint resumption or retry mechanisms? Given that network fluctuations are common during large-scale downloads, how is the tool's robustness demonstrated?
- Output confidence intervals The output file contains "TRUE/FALSE" and a probability score. For samples where the probability score is at a critical threshold (e.g., around 0.5), does the tool provide an "uncertain" tag, or does it force a classification? It is suggested to add an indicator for ambiguous ranges.
-
AbstractBackground Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.Findings Here, we present GEfetch2R, an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R (SeuratObject/DESeqDataSet), filter the SeuratObject based on …
AbstractBackground Downloading and reanalyzing the existing single-cell RNA sequencing (scRNA-seq) data provides an efficient choice to gain clues and new insights. However, no tool can fetch the diverse scRNA-seq data types (raw data, count matrix, and processed object) distributed in various repositories, process and load the downloaded data to R, convert formats between scRNA-seq objects, and benchmark the format conversion tools.Findings Here, we present GEfetch2R, an R package with Docker image to (i) download diverse scRNA-seq data types, including raw data (SRA and ENA), count matrices (GEO, UCSC Cell Browser, and PanglaoDB), and processed objects (Zenodo, CELLxGENE, and HCA); (ii) process the downloaded data, load output/downloaded count matrices and annotations to R (SeuratObject/DESeqDataSet), filter the SeuratObject based on cell metadata and genes, and merge multiple SeuratObjects if applicable; (iii) convert formats between the widely used scRNA-seq objects, including SeuratObject, AnnData, SingleCellExperiment, CellDataSet/cell_data_set, and loom, and benchmark format conversion tools in terms of information kept, usability, running time, and scalability to guide the tool selection. Furthermore, GEfetch2R can also download, process, and load bulk RNA-seq raw data (SRA and ENA) and count matrices (GEO) to R (DESeqDataSet).Conclusions GEfetch2R is an R package dedicated to facilitating researchers to access and explore the existing gene expression data from various public repositories. It can function as a data downloader (supports all three scRNA-seq and two bulk RNA-seq data types), a data processor (processes and loads the output/downloaded count matrices and annotations to R), and an object format converter (between the widely used scRNA-seq objects).
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag039), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1:
The manuscript presents GEfetch2R, an R package (with a Docker image) that fetches scRNA-seq and bulk RNA-seq data from multiple repositories, loads the data into R objects, and benchmarks format-conversion tools. The problem addressed is real and important; the implementation appears practical and well documented. I see strong potential for adoption. Major comments
Robust cross-repository support for .RData files While GEfetch2R lists rdata among supported extensions for Zenodo and HCA, many GEO submissions and other archives still provide processed data exclusively as .RData, often bundling matrices and metadata in heterogeneous objects. Please add an explicit, repository-agnostic .RData ingestion path with: (i) automatic object introspection, (ii) standardized extraction of matrices/metadata, (iii) graceful fallbacks with clear diagnostics for non-standard objects, and (iv) reproducible examples. This materially increases real-world coverage.
Large-scale, automated evaluation on ~100 scRNA-seq datasets Beyond the single COVID-19 application and the conversion benchmark, please include a systematic "fetch success-rate" study across ~100 GEO scRNA-seq datasets. Provide a Dockerized workflow (publicly available) that periodically attempts end-to-end retrieval (raw / count / processed) and reports success/failure rates stratified by repository and file type, with resource/time footprints and categorized failure causes. Given heterogeneous deposition practices, even ~50% overall success would be informative.
3)Another very important point is to provide a Dockerfile together with the Docker. Minor revisions
"altas" → atlas (COVID-19 section title/caption).
"Count maatrix" → Count matrix (Figure 3 caption/table column).
"PanglanDB" → PanglaoDB (tables).
Consistency: keep SeuratObject (not "Seurat object"); keep rds lowercase;
-
