HVRLocator: a computationally efficient tool for identifying hypervariable regions in large 16S rRNA datasets

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Background

Metabarcoding of the 16S rRNA gene is widely used to assess microbial diversity due to its cost-effectiveness and efficiency. However, publicly available 16S rRNA metabarcoding datasets often lack standardized metadata, particularly information on the sequenced hypervariable regions or primers used, which are critical to their accurate reuse. To address this, we present HVRLocator, a computational tool that (1) identifies the start and end positions of 16S rRNA amplicons, (2) determines their corresponding hypervariable regions, and (3) detects the presence of primer sequences. This tool was validated on four datasets comprising 41,513 samples generated with different primers and sequencing platforms.

Results

HVRLocator can process archived 16S rRNA sequences from NCBI SRA at an average rate of 6.5 samples per minute. Validation showed it reliably detects amplicon start and end positions across datasets sequenced with different primers and platforms, achieving 100% accuracy within single-platform studies and correctly revealing length heterogeneity across platforms. It also flagged misannotated metadata and problematic sequences, underscoring its value as a sequence data curation tool. Finally, HVRLocator can select comparable sequences to build large 16S rRNA amplicon databases spanning the same hypervariable region, facilitating cross-study comparisons.

Conclusion

HVRLocator overcomes unreliable metadata by accurately identifying 16S rRNA amplicon start and end positions, determining hypervariable regions, and detecting primer sequences, enabling accurate curation and large-scale processing of 16S rRNA data for reliable and reproducible microbial studies, syntheses, and meta-analyses.

Article activity feed

  1. AbstractBackground Amplicon sequencing of the 16S rRNA gene is widely used to assess microbial diversity due to its cost-effectiveness and efficiency. However, public 16S rRNA datasets often lack standardized metadata, particularly information on the sequenced hypervariable regions or primers used, which are critical for accurate analysis and data reuse. To address this, we present the HVRLocator, a computational tool that reliably identifies sequenced hypervariable regions, enhancing metadata quality and enabling more robust large-scale microbiome studies.Results The HVRLocator tool processed samples at an average rate of 0.147 per minute. Validation confirmed 100% accuracy in predicting alignment positions, correctly matching sequences to the expected primer regions based on literature. We demonstrated how to use the tool to select appropriate and comparable sequences for building a global bacterial database from V4 region amplicons of the 16S rRNA gene. Using HVRLocator, we selected 36,217 valid samples out of 45,882 runs, enabling us to identify cases where metadata incorrectly labeled sequences as targeting the V4 region.Conclusion Even when metadata is available, it can be inaccurate or misleading. HVRLocator offers a reliable and efficient method to identify the exact hypervariable sequenced region, ensuring accurate processing of large-scale 16S rRNA amplicon data. By bypassing inconsistent metadata and literature, it streamlines data curation and enhances the reliability of microbial studies, syntheses, and meta-analyses. Its use is essential for critically evaluating published data and enabling accurate and reproducible research in microbial ecology.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag040), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2:

    General Comments This manuscript introduces a tool named HVRLocator, designed to address the issue of missing or non-standard metadata in 16S rRNA sequencing data found in public databases such as the SRA. The tool identifies amplicon regions by aligning sequences to a reference genome and attempts to detect the presence of primers using a machine learning model. This is a subject with significant practical value, particularly for conducting large-scale meta-analyses. However, there are still many issues regarding methodological rigor, the depth of validation, and comparisons with existing tools that require further clarification by the authors. Major Comments

    1. Concerns regarding the singularity of the reference sequence The authors mention aligning sequences to a single Escherichia coli (J01859.1) reference genome to determine start and end positions. Is a single E. coli reference sufficient to cover Archaea or bacterial phyla that are distantly related to Proteobacteria, which may be present in environmental samples (e.g., soil, ocean)? For taxa with significant length variations or insertions/deletions (Indels), could forced alignment to the E. coli reference lead to misjudgment of start/end positions? Have the authors evaluated the impact on accuracy if a more universal reference database (such as representative sequences from SILVA or Greengenes) were used?
    2. Rationality of the primer detection model (Random Forest based on Quality Scores) The authors developed a Random Forest model to predict primer presence by analyzing the quality score distribution of the first 1,000 reads. Primer detection is typically based on the sequence itself rather than quality scores. Can the authors explain why quality scores were chosen as features? Sequencing quality scores are influenced by technical factors such as sequencer status, reagent batches, and run cycles, which have no direct biological correlation with the presence of primers. Is there a risk that this model is "overfitting" specific sequencing platforms or datasets? Since the reads are already downloaded, why not directly use degenerate primer sequence matching (e.g., using Cutadapt or SeqKit logic) to determine primer presence? This seems to be a more direct and accurate method.
    3. Verification of accuracy claims In the validation section, the authors claim to achieve 100% accuracy on certain datasets. In bioinformatics tool development, a claim of 100% accuracy is often a red flag. Have the authors manually checked those samples marked as "correct" by the model that might suffer from edge effects or borderline cases?
    4. Dataset imbalance in the Random Forest model For the Random Forest model, the authors used 882 samples with primers and 8,940 samples without primers for training. Such an extremely imbalanced dataset, even with stratified sampling, may cause the model to be biased towards the majority class.
    5. Comparison with existing tools The manuscript mentions that no tool has been designed for this specific purpose, but this may overlook some existing general-purpose tools or scripts. Many pipelines (such as certain plugins in QIIME 2, USEARCH, etc.) possess functionalities to identify primers or evaluate amplicon regions. The authors should discuss how their tool compares to these existing workflows. Minor Comments
    6. Confusion regarding processing speed metrics The abstract mentions a processing speed of "0.147 samples per minute", but later the text mentions "6.5 samples per minute" and "one sample every 0.147 minutes". There is confusion regarding units and values in these three descriptions (is it samples per minute or minutes per sample?). Please unify and correct these data to ensure consistency.
    7. Usage of fastq-dump The use of fastq-dump is mentioned. The SRA Toolkit's fastq-dump is relatively slow and has largely been superseded by fasterq-dump for efficiency. Why did the authors not use the more efficient fasterq-dump?
    8. Definition of "Standardized metadata" The term "standardized metadata" is used frequently. Please explicitly define what constitutes "standard" metadata in the context of this tool within the text.
    9. Robustness and error handling The results section mentions that some samples failed due to "NCBI portal-related issues". Does this imply the tool lacks breakpoint resumption or retry mechanisms? Given that network fluctuations are common during large-scale downloads, how is the tool's robustness demonstrated?
    10. Output confidence intervals The output file contains "TRUE/FALSE" and a probability score. For samples where the probability score is at a critical threshold (e.g., around 0.5), does the tool provide an "uncertain" tag, or does it force a classification? It is suggested to add an indicator for ambiguous ranges.
  2. AbstractBackground Amplicon sequencing of the 16S rRNA gene is widely used to assess microbial diversity due to its cost-effectiveness and efficiency. However, public 16S rRNA datasets often lack standardized metadata, particularly information on the sequenced hypervariable regions or primers used, which are critical for accurate analysis and data reuse. To address this, we present the HVRLocator, a computational tool that reliably identifies sequenced hypervariable regions, enhancing metadata quality and enabling more robust large-scale microbiome studies.Results The HVRLocator tool processed samples at an average rate of 0.147 per minute. Validation confirmed 100% accuracy in predicting alignment positions, correctly matching sequences to the expected primer regions based on literature. We demonstrated how to use the tool to select appropriate and comparable sequences for building a global bacterial database from V4 region amplicons of the 16S rRNA gene. Using HVRLocator, we selected 36,217 valid samples out of 45,882 runs, enabling us to identify cases where metadata incorrectly labeled sequences as targeting the V4 region.Conclusion Even when metadata is available, it can be inaccurate or misleading. HVRLocator offers a reliable and efficient method to identify the exact hypervariable sequenced region, ensuring accurate processing of large-scale 16S rRNA amplicon data. By bypassing inconsistent metadata and literature, it streamlines data curation and enhances the reliability of microbial studies, syntheses, and meta-analyses. Its use is essential for critically evaluating published data and enabling accurate and reproducible research in microbial ecology.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag040), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1:

    Metabarcoding data are accumulating rapidly. This paper makes a very valuable contribution to the automated extraction and curation of metabarcoding data and should be of great value in facilitating the re-use of existing data and the construction of custom databases based on these. I have not tested or tried to install the software myself, as the manuscript provided sufficient detail to enable me to assess the tool

    General comments

    The manuscript is written entirely in terms of "bacteria" and aligns amplicons to an E. coli model sequence. This is reasonable, but there should certainly be some acknowledgement of Archaea and ideally some mention of Eukaryotes too. These are probably things for the discussion section of this manuscript, but the authors may wish to consider whether a future version of the program could contain options to use model Archaea and Eukaryote sequences as alternatives to the E. coli model. It would also be helpful to assess how the program with its E. coli model deals with sequence data from Archaea, Eukaryotes (including mitochondria) and bacteria that are very divergent from E. coli. The methods section does not contain details of software used to generate the figures, or whether these figures are produced by "the pipeline" or by separate analysis of the .txt file that the pipeline produces. I suspect that it is that latter, in which case making the authors should make the scripts used available - as well as providing complete documentation of what has been done, this is likely to increase use made of the tool. And it would be helpful to include an output file in the supplementary materials

    Specific comments

    Line 64 "however the integration of these data in light of processing metadata" - not clear

    Line 67-8 "though bacterial diversity increases linearly with amplicon length". Needs re-wording. The number of ASVs will increase with amplicon length, but the actual bacterial diversity in a sample is constant.

    Line 79 "Wasimuddin and colleagues" should be "Wasimuddin et al". More generally, check that citations conform with journal house style

    Line 79-82 "For example, Wasimuddin and colleagues [8] found that compared to three other primer sets targeting different regions, the primer pair targeting the V4 hypervariable region of the 16S rRNA gene produced the highest estimates of species richness and diversity across various sample types" There are three issues here:

    1. different primer pairs vary in their coverage and bias, so different primers targeting the same variable region will produce different numbers of ASVs
    2. Even with complete coverage and the absence of bias, different variable regions will generate different numbers of ASVs as a result of differences in length and rate of evolution between variable regions (and differences in the number of ASVs that are clustered into OTUs at a particular sequence similarity threshold
    3. The relationship between ASVs or OTUs and "species" is not straightforward (Edgar, 2018). At minimum "species" should be replace with ASV or OTU (whichever Wasimuddin et al used)

    Line 89-90 "as bacterial diversity and taxonomic resolution linearly increase with target sequence length [12]." Overlaps with statement made in line 67-8, and the same issue applies here.

    Lines 167-170. The output file contains (amongst other things) "Predicted HV region Start/End: Predicted hypervariable (HV) region based on the median alignment start and end positions across all reads, inferred from literature on conserved and hypervariable regions of the 16S rRNA gene (Brosius et al., 1978; Yang et al., 2016)". This implies that the program predicts a single variable region for each study - I am not clear what this column will contain for amplicons that contain more than one variable region, although columns 11-19 indicate that the program identifies the presence/absence of each of the 9 HV regions. My guess is that the authors are using "HV region" in two different sense:

    1. Its usual meaning of one region out of V1 to V9
    2. The sequence from the beginning of the first of the nine variable regions the amplicon includes to the end of the last. It would also be helpful to indicate whether the sequence positions here are relative to the E coli model or refer to sequence positions in the amplicon

    Edgar, R. C. (2018). Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics, 34(14), 2371-2375. doi:10.1093/bioinformatics/bty113