Vulture: cloud-enabled scalable mining of microbial reads in public scRNA-seq data

Junyi Chen
Danqing Yin
Harris Y H Wong
Xin Duan
Ken H O Yu
Joshua W K Ho

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

The rapidly growing collection of public single-cell sequencing data has become a valuable resource for molecular, cellular, and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host–microbial studies from the public domain. In our benchmarking experiments, Vulture is 66% to 88% faster than local tools (PathogenTrack and Venus) and 41% faster than the state-of-the-art cloud-based tool Cumulus, while achieving comparable microbial read identification. In terms of the cost on cloud computing systems, Vulture also shows a cost reduction of 83% ($12 vs. ${\$}$70). We applied Vulture to 2 coronavirus disease 2019, 3 hepatocellular carcinoma (HCC), and 2 gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell type–specific enrichment of severe acute respiratory syndrome coronavirus 2, hepatitis B virus (HBV), and Helicobacter pylori–positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host–microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

GigaScience
Feb 11, 2024

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied …

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

** Reviewer 1 Liuyang Zhao ** R1 version

The manuscript presented by the authors provides a useful tool on the microbiome, which named "Vulture: Cloud-enabled scalable mining of microbial reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "microbiome in public data". However, the author comments not fully address my concerned, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: 1. A docker provided is better, however, most used install method conda is still missing. 2. The more microbial detect example is missing. Can you provide an example of using like Kraken2 full NCBI database (RefSeq) to check all the microbial is more useful. 3. Author still not promotion his software in social media. If no more people take part in use it, how can we know it's useful? The reviewers still have may work to do. Not have enough time to test this software. Just promote it in twitter and Chinese WeChat will help software better. 4. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to millions of hits in Google scholar. Must be no hit is a unique nameï¼ŒOK? Otherwise, hardly to know the read number of users. 5. The source code to support the generation of individual figures in this paper will be available on the GigaDB after being published. Where to check by the reviewers?

Read the original source
GigaScience
Feb 11, 2024

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied …

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

** Reviewer 3 Liuyang Zhao ** Original submission

The authors aim to develop Cloud-enabled approaches for detecting viral reads in public single-cell RNA sequencing (scRNA-seq) data. This study makes a significant contribution to the identification of viruses and bacteria in public scRNA-seq data. Although the outcomes are satisfactory, the novelty of the proposed methods is limited. To date, no evidence has been provided to demonstrate their superiority over recently published methods (such as PathogenTrack and Venus, et al) when executed on a local machine. There are also several issues that need to be further addressed, as highlighted below: 1.The documentation available on the GitHub pipeline does not explain how to utilize the latest virus database or how to incorporate a user's custom database. Because the virus database is updated very quickly now. It might be more appropriate if the author updates the database promptly or if one can customize and create their own database. 2. Figure 2a only has an overall comparison graph, it can be improved by adding detailed comparison graphs with Cumulus, PathogenTrack and Venus. 3. Figure 2b. The persuasiveness is not enough, it would be better to compare several pipeline platforms with similar functionalities or compare some specific steps, such as the four steps in figure 2a. By the way, all of these comparisons use comparison software developed by other same researchers, so please provide a detailed description of why the author's method is faster? 4. Figure 3c can be created with microbial clustering and non-microbial clustering to highlight the impact of virus identification on classification results. 5. Fig. S1 It should be the "Quality control on read level".

Read the original source
GigaScience
Feb 11, 2024

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied …

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

** Reviewer 2 Jingzhe Jiang ** Original submission

In this study, Chen et al. introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. And they further applied Vulture to COVID-19, HCC, and gastric cancer human patient cohorts with public sequencing reads dataand discovered cell-type specific enrichment of SARSCoV2, hepatitis B virus (HBV), and H. pylori positive cells. Generally speaking, this study is innovative, has good application potential, and can better assist the work of single cell research from the point of view of infection. I only a few minor questions that need the author to reply: 1. Background: The first appearance of H. pylori should be replaced with its full name. 2. Methods-Downstream analysis of scRNA-seq samples: Why use different tools (SCANPY/Seurat, BBKNN/Harmony) to analyze different datasets instead of using the same tool to analyze different datasets? 3. Cell-type enrichment of microbial UMI: format error of formula. 4. Analyses-Page 11: "The statistical test identified that SARS-CoV-2 is enriched (p-value < 0.05) in epithelial cells, neutrophils, and plasma B cells (Fig. 3d and Table. 2)". It is best to highlight p < 0.05 data points in other colors rather than red squares. Why are there no p < 0.05 square in fig. 3e? 5. Fig. 2a and 2b: There are 8 colors in figure 2a, however only 4 figure legend were showed. What do the four light-colored bar mean? And the same to Fig 2b.

Read the original source
GigaScience
Feb 11, 2024

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied …

The rapidly growing collection of public single-cell sequencing data have become a valuable resource for molecular, cellular and microbial discovery. Previous studies mostly overlooked detecting pathogens in human single-cell sequencing data. Moreover, existing bioinformatics tools lack the scalability to deal with big public data. We introduce Vulture, a scalable cloud-based pipeline that performs microbial calling for single-cell RNA sequencing (scRNA-seq) data, enabling meta-analysis of host-microbial studies from the public domain. In our scalability benchmarking experiments, Vulture can outperform the state-of-the-art cloud-based pipeline Cumulus with a 40% and 80% reduction of runtime and cost, respectively. Furthermore, Vulture is 2-10 times faster than PathogenTrack and Venus, while generating comparable results. We applied Vulture to two COVID-19, three hepatocellular carcinoma (HCC), and two gastric cancer human patient cohorts with public sequencing reads data from scRNA-seq experiments and discovered cell-type specific enrichment of SARS-CoV2, hepatitis B virus (HBV), and H. pylori positive cells, respectively. In the HCC analysis, all cohorts showed hepatocyte-only enrichment of HBV, with cell subtype-associated HBV enrichment based on inferred copy number variations. In summary, Vulture presents a scalable and economical framework to mine unknown host-microbial interactions from large-scale public scRNA-seq data. Vulture is available via an open-source license at https://github.com/holab-hku/Vulture.

This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad117), and has published the reviews under the same license. These are as follows.

** Reviewer 1 Yongxin Liu** Original submission

The manuscript presented by the authors provides a useful tool on the virome, which named "Vulture: Cloud-enabled scalable mining of viral reads in public scRNA-seq data", using a large and valuable dataset. The study is important in deepening our understanding of "virome in public data". However, there are some issues for improvement in the manuscript. Here are the requirements for new software that is good enough to be published: Major comments: 1. The software, tested data and results are required to be uploaded on GitHub for peers to use, and conda and/or docker installation modes are recommended for software with complex dependencies. We will take software Star, Fork, and downloads of GitHub as one of the audience indicators. I found the GitHub links: https://github.com/holab-hku/Vulture. However, the readme.md show pipeline on AWS cloud. If I not have an AWS, how can I run it in my server. Now this project is only 2 stars. You need more people to take part in and interest in this project. 2. Software installation and User tutorial are required in Readme.md or Wiki in GitHub. Please provide step by step protocol to deploy it in the laptop or server. 3. A video of software download, installation, operation, and result display is required with a computer or server without any related software installed, to make sure that any new user can perform the whole process according to the tutorial. 4. The software is required to be posted on twitter and other social media, you can contact @ iMetaScience, @microbe_article etc. to get help in tweet or retweet. The number of Retweet, Like and View as one of the audience indicators. 5. Chinese is largest single langue science society. Provide the Chinese tutorial and video presentation of the software, contact meta-genome Official account for help to promote. The Number of readers, share and favorite also one of the audience indicators. 6. According to the feedback from users in all over the world, the author continuously maintains and optimizes the method to ensure its availability, ease of use and advancement. 7. The software name should be unique, which is convenient to count the real users through all available resources (such as QIIME, ImageGP, and EasyAmplicon). However, the name vulture is unacceptable, due to million of hits in Google scholar. 8. The figures in your papers are diversity. However, I cannot find enough visualization function in your pipeline. The pipeline for integrated software is easy, the specific and diversity visualization plan is difficult. All the authors want their analysis result is ready-to-published. 9. Why only focus on the virus? Can this pipeline to generated all the microbiome, which is more interest and overview of the microbes.

Read the original source
Version published to 10.1093/gigascience/giad117
Jan 1, 2024
Version published to 10.1101/2023.02.13.528411 on bioRxiv
Feb 15, 2023

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed