LRTK: A platform agnostic toolkit for linked-read analysis of both human genomes and metagenomes

This article has been Reviewed by the following groups

Read the full article

Abstract

Linked-read sequencing technologies generate high base quality reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and has been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to one specific sequencing platform. To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genomes and metagenomes. LRTK provides functions to perform linked-read simulation, barcode error correction, read cloud assembly, barcode-aware read alignment, reconstruction of long DNA fragments, taxonomic classification and quantification, as well as barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically, and provides the user with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on two benchmarking and three real linked-read data sets from both the human genome and metagenome. We showcase LRTK’s ability to generate comparative performance results from the preceding benchmark study and to report these results in publication-ready HTML document plots. LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.

Article activity feed

  1. Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.

    Reviewer 3: Dmitrii Meleshko The paper titled "LRTK: A Platform-Agnostic Toolkit for Linked-Read Analysis of Both Human Genomes and Metagenomes" by Yang et al. is dedicated to the development of a unified interface for linked-read data processing.The problem described in the paper indeed exists; each linked-read technology requires complex preprocessing steps that are not straightforward or efficient. The idea of consolidating multiple tools in one place, with some of them modified to handle multiple data types, is commendable. Overall, I am supportive of this paper. My main concern, however, is that the impact of linked-read applications in the paper appears to be exaggerated, and the authors need to provide more context in their presentation. Also, some parts of the paper are vague described. I will elaborate on my concerns in more detail below.X) Linked-read sequencing generates reads with high base quality and extrapolative 64 information on long-range DNA connectedness, which has led to significant 65 advancements in human genome and metagenome research[1-3]. - Citations 1-3 do not really tell about advancements in human genome and metagenome research, these are technologies papers. Similar problem can be found in "Despite the limitations that genome specificity…" paragraph. Authors cited and described several algorithms, that are not really genomic studies. E.g. "stLFR[2] has found application in a customized pipeline that has been developed to first convert its raw reads into a 10x-compatible format, after which Long Ranger is applied for downstream analysis." is not an example of genomic study, but a pipeline description.X) Table S1 does not improve the paper, I would say it does completely the opposite. LongRanger is not a toolkit, it should be considered as read alignment tool that outputs some SVs and haplotypes along the way. So LongRanger vs LRTK comparison does not make sense to me. There are other tools that solve metagenome assembly problem, human assembly problem, call certain classes of SVs etc.x) I think incorporating longranger is important, since its performance is reported to be better than EMA for human samples and it is also more popular than EMA. Is it possible and have you tried doing it?x) I would remove exaggerations such as "myriad" from the text. The scope of linked-reads is pretty limited nowadays. I agree that linked-reads might be useful in metagenomics/transcriptomics and other scenarios that were mentionedin the text, but the number of studies is very limited especially nowadays, and was not really big when 10X platform was on the risex) "LRTK reconstructs long DNA fragments" - when people talk about long fragment reconstruction, they usually mean moleculo-style reconstruction through assembly. This reconstruction resemble "barcode deconvolution", described in Danko et al, and Mak et al. So I would stick to this terminologyx) it is important to note that, Aquila, LinkedSV and VALOR2 are linked-read specific tools, while FreeBayes, Samtools and GATK are short-read tools. Also, provide target SV length for both groups of tools.x) There are some minor problems with Github readme. E.g. "*parameters". Also, I don't understand how to use conversion in real life… E.g. 10X Genomics data often comes as a folder with multiple gzipped R1/R2/I1 files. I don't understand how would I use it in that case.x) Please cite or explain why this is happening (not only when) - "A known concern with stLFR linked-read sequencing is the loss of barcode specificity during analysis."x) I don't understand what is "Length-weighted average (μFL) and unweighted average (WμFL) of DNA 688 fragment lengths." from the figure. One of them is just an average and what about second? Figure looks confusingx) LRTK supports reconstruction of long DNA fragments - this section describes something else. More about statistics and data QCx) LRTK promotes metagenome assembly using barcode specificity - please remove supernova, it was never a metagenomic assembler. Check cloudSPAdes insteadx) "The superior assembly performance we have observed" - superior compared to what? If so, some short-read benchmark should be included.x) "LRTK improves human genome variant phasing using long range information" - What dataset is this? What callset was used for ground truth? Briefly describe how comparisons were done?x) Figures 5F-G together are very confusing.First I don't expect tools like LinkedSV to have high recall (around 1.0) and low precision. Also, figure G is kind of subset of figure F, but results are completely different. Also use explicit notation. E.g. 50-1kbp and 1-10kbp mean completely different thingsx) We curated one benchmarking dataset and two real datasets to demonstrate the 307 performance of LRTK - what do you mean by "curation" herex) Why don't you use Tell-Seq barcode whitelist mentioned here - https://sagescience.com/wpcontent/uploads/2020/10/TELL-Seq-Software-Roadmap-User-Guide-2.pdfx) Tiered alignment approach is vaguely introduced. It is not clear what "n% most closely covered windows." mean, or how do we select a subset of reference genomes for the second phase

  2. benchmarking and three real linked-read data sets from both the human genome and metagenome. We showcase LRTK’s ability to generate comparative performance results from the preceding benchmark study and to report these results in publication-ready HTML document plots. LRTK provides comprehensive and flexible modules along with an easy-to-use

    Reviewer 2: Lauren Mak Summary: This manuscript describes the need for a generalized linked-read (LR) analysis package and showcases the package the authors developed to address this need. Overall, the workflow is welldesigned but there are major gaps in the benchmarking, analysis, and documentation process that need to be addressed before publication.Documentation:The purpose of multiple tool options: While the analysis package is technically sound, one major aspect is left unexplained- why are there so many algorithm options included without guidance as to which one to use? There are clearly performance differences by different algorithms (combinations of 2+ not considered either) on different types of LR sequence.Provenance of ATCC-MSA-1003: Nowhere in the manuscript is the biological and technical composition of the metagenomics control described. It would be helpful to mention that this is specifically a mock gut microbiome sample, as well as the relative abundances of the originating species as well as the absolute amounts of genetic material per species (ex. as measured by genomic coverage) in the actual dataset. As a corollary, there should be standard deviations in any figures that display a summary statistic (ex. Figure 3A- precision, recall, etc.) that seems to be averaged across the species in a sample. This includes Figure 3A and Figure 4A.Dataset details: There is no table indicating the number of reads for each dataset, which would be helpful in interpreting Figures 3 and 4.Open source?: However, there was no Github link provided, only a link to the Conda landing page. Are there thorough instructions provided for the package's installation, input, output, and environment management?Benchmarking:The lack of simulated tests: The above concern (expected performance on idealized datasets) is best addressed with simulated data, which was not done despite the fact that LRSim exists (and apparently the authors have written a tool for stLFR as well previously).Indels: What are the sizes of the indels detected? Why were newer tools, such as PopIns2, Pamir, or Novel-X not tried as well?Analysis:Lines 166-169: Figure 1 panel A1 vs. B1- why do the distribution of estimated fragment sizes from the 10x datasets look so different in metagenomic vs. human samplees, when there is reasonable consistency in TELL-Seq and stLFR datasets?Lines 182-184: Figure 3A- why is LRTK's taxonomic classification quality generally lower than the of the tools? At least in terms of recall, it should perform better as mapping reads to reference genomes should have a lower false negative rate than k-mer-based tools. Also, what is the threshold for having detect a taxon? Is it just any number of reads or is there a minimum bound?Lines 187-188: Figure 3B- at least 15% of each caller's set of variants is unique to the variant, while a maximum of 50% is universal. I'd not interpret that as consistency.Lines 192-193: Are you referring to allelic imbalance as it is popularly used to refer to expression variation between the two haplotypes of a diploid organism? This clearly doesn't apply in the case of bacteria. If this is not what you're referring to, please define and/or cite the applicable definition.Lines 201-208: It's odd that despite the 10x datasets having the largest estimated fragment size, they have some of the smallest genome fractions, NGA50, and NA50. Why is this? Are they just smaller datasets, on average?Miscellaneous:UHGG: Please mention the fact that the UHGG is the default database, as well as whether or not the user will be able to supply their own databases.Line 363: What does {M} refer to?Line 369: What does U mean here? Is this the number of uniquely aligned reads in one of the windows N that a multi-aligned read aligns to?Lines 371-372: What does 'n% most closely covered windows' refer to?Lines 399-405: How are SNVs chosen for MAI analysis from the three available SNV callers?Lines 653-656: Which dataset was used for quality evaluation?Line 665: What do the abbreviations BAF and T stand for?

  3. Linked-read sequencing technologies generate high base quality reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and has been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to one specific sequencing platform. To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genomes and metagenomes. LRTK provides functions to perform linked-read simulation, barcode error correction, read cloud assembly, barcode-aware read alignment, reconstruction of long DNA fragments, taxonomic classification and quantification, as well as barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically, and provides the user with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on two

    Reviewer 1: Brock Peters Yang et al. describe a package of tools, LRTK, for cobarcoded reads (linked reads) agnostic of library preparation methods and sequencing platforms. In general, it appears to be a very useful tool. I have a few concerns with the manuscript as it is currently written:1. Line 203 "With Pangaea,LRTK achieves NA50 values of 1.8 Mb and 1.2 Mb for stLFR and TELL-Seq sequencing data, respectively. On 10x Genomics sequencing data, Athena exhibited superior assembly performance, with a NGA50 of 245 Kb."This is a bit of an awkward two sentences as you are comparing NA50 values for stLFR and TELL-Seq and then NGA50 for 10X Genomics and it makes it sound like 10X Genomics performed the best. Also, these numbers don't seem to agree with the figure.2. How long does an average run take to process? Say a 35X human genome coverage sample? Are there requirements for memory? A figure and metrics around this sort of thing would be helpful.3. How much data was used per library? What was the total coverage? Was the data normalized to have the same coverage per library? If not, it's very difficult to make fair comparisons between the different technologies.4. There's a section on reconstruction of long fragments, but then there really isn't any evaluation of this result and it's not clear if these are even used for anything. For all of these sequencing types I would assume that you can't really do much in the way of seed extension since the coverage across long fragments for these methods is much less than 1X. I think this needs to be developed a little more or it needs to be explained how these are used in your process or you just need to say you didn't use them for anything but here's some potential applications they could be used for. What type of file is output from this process? I think it's interesting, but just not clear how to use this data.5. I did try to install the software using Conda, but it failed and it's not clear to me why. Perhaps it's something about my environment, but you might want to have some colleagues located in different institutions try to install it to make sure it is easy to do so.