High-resolution species assignment of Anopheles mosquitoes using k-mer distances on targeted sequences

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Boddé et al propose a new approach for species identification in the genus Anopheles. The approach uses an amplicon panel, a kmer-based similarity metric, and a variant auto-encoder to minimize issues of sequence alignment between divergent lineages. The authors provide strong evidence that their approach works well for most samples. The work will be of potential interest to practitioners in the field of parasite carrying mosquitoes.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The ANOSPP amplicon panel is a genus-wide targeted sequencing panel to facilitate large-scale monitoring of Anopheles species diversity. Combining information from the 62 nuclear amplicons present in the ANOSPP panel allows for a more senstive and specific species assignment than single gene (e.g. COI) barcoding, which is desirable in the light of permeable species boundaries. Here, we present NNoVAE, a method using Nearest Neighbours (NN) and Variational Autoencoders (VAE), which we apply to k- mers resulting from the ANOSPP amplicon sequences in order to hierarchically assign species identity. The NN step assigns a sample to a species-group by comparing the k -mers arising from each haplotype’s amplicon sequence to a reference database. The VAE step is required to distinguish between closely related species, and also has sufficient resolution to reveal population structure within species. In tests on independent samples with over 80% amplicon coverage, NNoVAE correctly classifies to species level 98% of samples within the An. gambiae complex and 89% of samples outside the complex. We apply NNoVAE to over two thousand new samples from Burkina Faso and Gabon, identifying unexpected species in Gabon. NNoVAE presents an approach that may be of value to other targeted sequencing panels, and is a method that will be used to survey Anopheles species diversity and Plasmodium transmission patterns through space and time on a large scale, with plans to analyse half a million mosquitoes in the next five years.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    Anopheles is an important disease vector and the efforts to characterize the extent of genetic variation in the system are welcome. In this piece, the authors propose a Variational Autoencoders method to assign species boundaries in a large sample of Anopheles mosquitoes using a panel of 62 nuclear amplicons. Overall, the method performs well as it can assign samples to an acceptable granularity. The main advantage of the method is that it takes reduced representation genome sampling which should cut costs in genotyping. The authors do not compare the effectiveness of their amplicon panel with other approaches to do reduced representation sequencing, or the computational method with other previously published methods. Additionally, the manuscript does not clearly state what is the importance of species assignments and the findings/method are -by definition- limited to a single biological system.

    It is important to draw the reviewer’s attention to the fact that this is a two part approach – the reviewer seems to have overlooked the Nearest Neighbour component of the work. The approach is not solely a VAE – the VAE only comes into play at the species complex level. The higher level assignments are done using NN approaches.

    The manuscript has three main limitations. First, there is no explicit test of the performance of ANOSPP compared to other methods of low-dimensional sampling. While the authors state that the ANOSPP panel will lead to genotyping for low cost (justifiably so), there is no direct comparison to other low-representation methods (e.g., RAD-Seq, MSG).

    The key advantage of ANOSPP is that it works on the entire Anopheles genus; while the other suggested sequencing methods are more applicable to a group of specimens of the same or closely related species. The purpose of the panel is to do species identification for the whole genus; so it really is an alternative to the current methods of species identification, which commonly consists of morphological identification of the species complex, followed by complex-specific PCR amplification of a single species-diagnostic locus. The only other species identification method for Anopheles that is not limited to a single species complex, that we are aware of, is a mass spectrometry approach (Nabet et al. Malar J, 2021); however, they only investigate three different species and reach a classification accuracy of at most 67.5%.

    The main advantage of ANOSPP over other reduced representation sequencing methods, like MSG and RAD-Seq, is that it is specifically designed to work for the entire Anopheles genus to support genus-wide species identification. In a genus comprising an estimated 100 million years of divergence, a sequencing approach that relies on restriction enzymes is likely to introduce a lot of variability in which parts of the genome are sequenced for different species. Moreover, both MSG and RAD-Seq typically map the reads to a reference genome; any choice of reference genome will likely introduce considerable bias when dealing with such diverged species. In general, the sequence data generated by those sequencing methods require more complicated and labour intensive processing. And lastly, the costs per sample for library preparation and sequencing are substantially lower with ANOSPP than with MSG and RAD-Seq: for library prep <1 USD (ANOSPP) versus 5 USD (RAD-Seq) (Meek and Larson, Mol Ecol Resour, 2019) and with 768 samples (ANOSPP), 384 samples (MSG; Andolfatto et al, Genome Res., 2011) and 96 samples (RAD-Seq; Meek and Larson, Mol Ecol Resour, 2019) per run.

    Second, and on a related vein, the authors present NNoVAE as a novel solution to determine species boundaries in Anopheles. Perusing the very references the authors cite, it is clear that VAEs have been used before to delimit species boundaries which diminishes the novelty of the approach on its own.

    The VAE is only a part of the method presented in this manuscript. We believe a substantial amount of the value of NNoVAE lies in its ability to perform assignments for the entire Anopheles genus comprising over 100 MY of divergence - the closest analogous approach would be COI or ITS2 DNA barcoding, neither of which is robust for species complexes. Using NNoVAE, samples are first assigned to their relevant groups, and in many cases to their species, by the Nearest Neighbour method. Only those samples that are identified by the Nearest Neighbour method as members of the An. gambiae complex and cannot be unambiguously assigned to a single species, are passed through the VAE assignment method.

    Indeed, in (Derkarabetian et al, Mol Phylogenet Evol, 2019) VAEs are used to delimit species boundaries in an arachnid genus. However, this study works with ultra conserved elements, incorporating a total of 76kB of sequence, which is much more data than the approximately 10kB we get for all amplicons combined. Moreover, a crucial difference is that the referenced work uses SNP calls, based on alignment to one of their sequenced samples, as input for the VAE, where our VAE takes k-mer based inputs. This is also an important consideration in working with a large number of highly diverged species.

    Perhaps more importantly, the manuscript does not present a comparison with other methods of species delimitation (SPEDEStem, UML -this approach is cited in the paper though-), or even of assessment of population differentiation, such as STRUCTURE, ADMIXTURE, or ASTRAL concordance factors (to mention a few among many). The absence of this comparative framework makes it unclear how this method compares to other tools already available.

    NNoVAE is primarily a method for species assignment rather than for species delimitation. SPEDEStem addresses the question whether different groups of samples are separate species or not; different groups can be defined by e.g. described races, described subspecies, different morphotypes or different collection locations. The aim of ANOSPP and NNoVAE is to remove the necessity of any prior sorting of samples into groups – all that needs to be known is that the sample is an Anopheline. This avoids the issues associated with morphological identification and single marker molecular barcodes. So to perform species assignment with SPEDEStem, we’d have to run many replicates, each time asking whether a single sample is of the same species as one of the species represented in our reference database. For example, for the 2218 samples presented in the case studies, we would have to run SPEDEStem more than 130,000 times, to check for each of these samples whether they are any of the 62 species represented in the reference dataset NNv1.

    However, we agree that it would be good to check that the species-groups in the reference database, NNv1, are indeed supported as separate species. We attempted to run SPEDEStem, but the web browser no longer exists, and we were not able to install the command line application, which runs on Python 2. Moreover, the example files provided in the tutorial are not complete. Therefore, we were unable to even carry out this basic comparison.

    UML (unsupervised machine learning) approaches comprise quite a wide range of methods, including VAE. We have conducted a comparison between the VAE assignments and assignments based on UMAP, for the discussion see below and page 20 in the manuscript and newly added supplementary information section 4.

    As requested by the reviewer, we have compared our assignment approach to ADMIXTURE on the Anopheles gambiae complex training set (see Supplementary information section 5). It is a good sanity check to compare the structure revealed by ADMIXTURE to the structure revealed by the VAE. We found that ADMIXTURE does not satisfyingly differentiate between the species in the complex that are only represented by a handful of samples, while the VAE suffers much less from the differences in group sizes in the training set. Moreover, we want to point out that ADMIXTURE is a tool for assessing population differentiation, not for species assignment. To use it as an assignment method, there are two options: either infer the allele frequencies in the ancestral populations from the training set and use those to compute the maximum likelihood of ancestry frequencies for the test set; or run ADMIXTURE on the training and test sets combined and use the labels from the training set to label ancestral populations. A major drawback from the former approach is that it is tricky to discover cryptic taxa or outliers in the test set; while with the second approach we create a dependency of the training set results on the test set it is combined with during the run. But more importantly, ADMIXTURE performs worse than the VAE on the An. gambiae complex training set by itself; and identifies only two to three different groups among the five diverged species (An. melas, An. merus, An. quadriannulatus, An. bwambae and An. fontenillei). For more information, see page 20 in the manuscript and newly added supplementary information section 5

    One important use case of our method is to identify interesting samples, e.g. potential hybrids or cryptic taxa, for subsequent whole genome sequencing. After selection and whole genome sequencing of interesting samples detected by ANOSPP+NNoVAE, ADMIXTURE may be useful as one of the tools to investigate such samples.

    A final concern is less methodological and more related to the biology of the system. I am curious about the possibility of ascertainment bias induced by the amplicon panel. In particular, the authors conclusively demonstrate they can do species assignment with species that are already known. Nonetheless, there is the possibility of unsampled species and/or cryptic species. This later issue is brought up in passing the 'Gambiae complex classifier datasets' section but I think the possibility deserves a formal treatment. This is particularly important because the system shows such high levels of hybridization that the possibility of speciation by admixture is not trivial.

    We appreciate the reviewer’s concern regarding ascertainment bias in the amplicon panel. The targets have been selected based on multiple sequence alignments of all Anopheles reference genomes at the time (Makunin et al. Mol Ecol Resour, 2022). Using sequenced species from four different subgenera, the species span a considerable amount of evolutionary time in the Anopheles genus. For all species we have since tested the panel on, we find that at least half of the targets get amplified.

    We share the reviewer’s concern regarding species which are not (yet) represented in the reference database. This is one of the main advantages of the Nearest Neighbour method: it works on three levels of increasing granularity. So for samples that cannot be assigned at species level, we are often able to identify the group of species from the reference database it is closest to. In particular, the situation of a test sample whose species is not represented in the reference database, is mimicked in the drop-out experiment by the species-groups which contain only one sample. On page 16 in the manuscript, we explain how NNoVAE deals with such samples and we show that in the majority of cases NNoVAE assigns the sample to a group of closely related species rather than misclassifying it more specifically to the wrong species.

    In summary, the main limitation of the manuscript is that the authors do not really elaborate on the need for this method. The manuscript does show that the method is feasible but it is not forthcoming on why this is of importance, especially when there is the possibility of generating full genome sequences.

    ANOSPP and NNoVAE are specifically designed for high throughput accurate species identification across the entire Anopheles genus – WGS is important to address many questions, but is complete overkill for doing species identification. ANOSPP costs only a small fraction of whole genome sequencing, which makes it possible to monitor mosquito populations at much larger scale (e.g., in partnership with our vector biologist collaborators in Africa, we have already generated ANOSPP data for approximately 10,000 mosquitoes and will be running 500,000 over the next few years). Moreover, for most analyses using whole genome sequencing, a reference genome of a sufficiently similar species is required. While we are in a position of privilege having reference genomes for more than 20 species in Anopheles, we have a long way to go before we have 100s of reference genomes covering the true diversity of the genus.

    NNoVAE can also be used to select interesting samples (e.g. species that have not been through the panel before, divergent populations, potential hybrids), which can be submitted for whole genome sequencing subsequently.

    Since Anopheles is arguably one of the most important insects to characterize genetically, the ANOSPP panel is certainly important but I am not completely sure the method of species assignment is novel or groundbreaking .

    Reviewer #2 (Public Review):

    The medically important mosquito genus Anopheles contains many species that are difficult or impossible to distinguish morphologically, even for trained entomologists. Building on prior work on amplicon sequencing, Boddé et al. present a novel set of tools for in silico identification of anopheline mosquitoes. Briefly, they decompose haplotypes generated with amplicon sequencing into kmers to facilitate the process of finding similar sequences; then, using the closest sequence or sequences ("nearest neighbors") to a target, they predict taxonomic identity by the frequency of the neighbor sequences in all groups present in a reference database. In the An. gambiae species complex, which is well-known for its historical and ongoing introgression between closely-related species, this approach cannot distinguish species. Therefore, they also apply a deep learning method, variational autoencoders, to predict species identity. The nearest neighbor method achieves high accuracy for species outside the gambiae complex, and the variational autoencoder method achieves high accuracy for species within the complex.

    The main strength of this method (along with the associated methods in the paper on which this work builds) is its ability to speed up the identification of anopheline mosquitoes, therefore facilitating larger sample sizes for a wide breadth of questions in vector biology and beyond. This technique has the added advantage over many existing molecular identification protocols of being non-destructive. This high-throughput identification protocol that relies on a relatively straightforward amplicon sequencing procedure may be especially useful for the understudied species outside the well-resourced gambiae complex.

    An additional and intriguing strength of this method is that, when a species label cannot be predicted, some basic taxonomic predictions may still be made in some cases. Indeed, even in the case of known species, the authors find possible cryptic variation within An. hyrcanus and An. nili, demonstrating how useful this new tool can be.

    The main weakness of this method is that, as the authors note, accuracy is dependent on the quality and breadth of the reference database (which in turn relies on the expertise of entomologists). A substantial portion of the current reference database, NNv1, comes from one species complex, An. gambiae. This is reasonable given the complex's medical importance and long history of study; however, for that same reason, robust molecular and computational tools for identifying species in this complex already exist. The deep learning portion of this manuscript is a valuable development that can eventually be applied to other species complexes, but building up a sufficient database of specimens is non-trivial. For that reason, the nearest neighbor method may be the more immediately impactful portion of this paper; however, its usefulness will depend on good sampling and coverage outside the gambiae complex.

    Another potential caveat of this method is its portability. It is not clear from either the manuscript or the code repository how easy it would be for other researchers to use this method, and whether they would need to regenerate the reference database themselves. The authors clearly have expansive and immediate plans for this workflow; however, as many researchers will read this manuscript with an eye towards using these methods themselves, clarifying this point would be valuable.

    This is an important point; currently the amplicon panel is only run on specialised robots, but we are working to adapt the protocol so that it can be run in any basic molecular lab. We have now clarified this in the conclusion. Furthermore, there is never a need to regenerate the reference databases – this is fully publicly available at github.com/mariloubodde/NNoVAE and version controlled. As we obtain ANOSPP data from additional samples, representing new species or new within-species diversity, we will add these to the reference database and create an updated openly available version.

    The authors present data suggesting that their method is highly accurate in most of the species or groups tested. While the usefulness of this method will depend on the reference database, two points ameliorate this potential concern: it is already accurate on a wide breadth of species, including the understudied ones outside the An. gambiae complex; additionally, even when a specific species identification cannot be made, the specimen may be able to be placed in a higher taxonomic group.

    Overall, these new methods offer an additional avenue for identifying anopheline species; given their high-throughput nature, they will be most useful to researchers doing bulk collections or surveillance, especially where multiple morphologically similar species are common. These methods have the potential to speed up vector surveillance and the generation of many new insights into anopheline biology, genetics, and phylogeny.

  2. Evaluation Summary:

    Boddé et al propose a new approach for species identification in the genus Anopheles. The approach uses an amplicon panel, a kmer-based similarity metric, and a variant auto-encoder to minimize issues of sequence alignment between divergent lineages. The authors provide strong evidence that their approach works well for most samples. The work will be of potential interest to practitioners in the field of parasite carrying mosquitoes.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  3. Reviewer #1 (Public Review):

    Anopheles is an important disease vector and the efforts to characterize the extent of genetic variation in the system are welcome. In this piece, the authors propose a Variational Autoencoders method to assign species boundaries in a large sample of Anopheles mosquitoes using a panel of 62 nuclear amplicons. Overall, the method performs well as it can assign samples to an acceptable granularity. The main advantage of the method is that it takes reduced representation genome sampling which should cut costs in genotyping. The authors do not compare the effectiveness of their amplicon panel with other approaches to do reduced representation sequencing, or the computational method with other previously published methods. Additionally, the manuscript does not clearly state what is the importance of species assignments and the findings/method are -by definition- limited to a single biological system.

    The manuscript has three main limitations. First, there is no explicit test of the performance of ANOSPP compared to other methods of low-dimensional sampling. While the authors state that the ANOSPP panel will lead to genotyping for low cost (justifiably so), there is no direct comparison to other low-representation methods (e.g., RAD-Seq, MSG). Second, and on a related vein, the authors present NNoVAE as a novel solution to determine species boundaries in Anopheles. Perusing the very references the authors cite, it is clear that VAEs have been used before to delimit species boundaries which diminishes the novelty of the approach on its own.

    Perhaps more importantly, the manuscript does not present a comparison with other methods of species delimitation (SPEDEStem, UML -this approach is cited in the paper though-), or even of assessment of population differentiation, such as STRUCTURE, ADMIXTURE, or ASTRAL concordance factors (to mention a few among many). The absence of this comparative framework makes it unclear how this method compares to other tools already available.

    A final concern is less methodological and more related to the biology of the system. I am curious about the possibility of ascertainment bias induced by the amplicon panel. In particular, the authors conclusively demonstrate they can do species assignment with species that are already known. Nonetheless, there is the possibility of unsampled species and/or cryptic species. This later issue is brought up in passing the 'Gambiae complex classifier datasets' section but I think the possibility deserves a formal treatment. This is particularly important because the system shows such high levels of hybridization that the possibility of speciation by admixture is not trivial

    In summary, the main limitation of the manuscript is that the authors do not really elaborate on the need for this method. The manuscript does show that the method is feasible but it is not forthcoming on why this is of importance, especially when there is the possibility of generating full genome sequences. Since Anopheles is arguably one of the most important insects to characterize genetically, the ANOSPP panel is certainly important but I am not completely sure the method of species assignment is novel or groundbreaking .

  4. Reviewer #2 (Public Review):

    The medically important mosquito genus Anopheles contains many species that are difficult or impossible to distinguish morphologically, even for trained entomologists. Building on prior work on amplicon sequencing, Boddé et al. present a novel set of tools for in silico identification of anopheline mosquitoes. Briefly, they decompose haplotypes generated with amplicon sequencing into kmers to facilitate the process of finding similar sequences; then, using the closest sequence or sequences ("nearest neighbors") to a target, they predict taxonomic identity by the frequency of the neighbor sequences in all groups present in a reference database. In the An. gambiae species complex, which is well-known for its historical and ongoing introgression between closely-related species, this approach cannot distinguish species. Therefore, they also apply a deep learning method, variational autoencoders, to predict species identity. The nearest neighbor method achieves high accuracy for species outside the gambiae complex, and the variational autoencoder method achieves high accuracy for species within the complex.

    The main strength of this method (along with the associated methods in the paper on which this work builds) is its ability to speed up the identification of anopheline mosquitoes, therefore facilitating larger sample sizes for a wide breadth of questions in vector biology and beyond. This technique has the added advantage over many existing molecular identification protocols of being non-destructive. This high-throughput identification protocol that relies on a relatively straightforward amplicon sequencing procedure may be especially useful for the understudied species outside the well-resourced gambiae complex.

    An additional and intriguing strength of this method is that, when a species label cannot be predicted, some basic taxonomic predictions may still be made in some cases. Indeed, even in the case of known species, the authors find possible cryptic variation within An. hyrcanus and An. nili, demonstrating how useful this new tool can be.

    The main weakness of this method is that, as the authors note, accuracy is dependent on the quality and breadth of the reference database (which in turn relies on the expertise of entomologists). A substantial portion of the current reference database, NNv1, comes from one species complex, An. gambiae. This is reasonable given the complex's medical importance and long history of study; however, for that same reason, robust molecular and computational tools for identifying species in this complex already exist. The deep learning portion of this manuscript is a valuable development that can eventually be applied to other species complexes, but building up a sufficient database of specimens is non-trivial. For that reason, the nearest neighbor method may be the more immediately impactful portion of this paper; however, its usefulness will depend on good sampling and coverage outside the gambiae complex.

    Another potential caveat of this method is its portability. It is not clear from either the manuscript or the code repository how easy it would be for other researchers to use this method, and whether they would need to regenerate the reference database themselves. The authors clearly have expansive and immediate plans for this workflow; however, as many researchers will read this manuscript with an eye towards using these methods themselves, clarifying this point would be valuable.

    The authors present data suggesting that their method is highly accurate in most of the species or groups tested. While the usefulness of this method will depend on the reference database, two points ameliorate this potential concern: it is already accurate on a wide breadth of species, including the understudied ones outside the An. gambiae complex; additionally, even when a specific species identification cannot be made, the specimen may be able to be placed in a higher taxonomic group.

    Overall, these new methods offer an additional avenue for identifying anopheline species; given their high-throughput nature, they will be most useful to researchers doing bulk collections or surveillance, especially where multiple morphologically similar species are common. These methods have the potential to speed up vector surveillance and the generation of many new insights into anopheline biology, genetics, and phylogeny.

  5. Reviewer #3 (Public Review):

    This manuscript develops new approaches to species assignment in Anopheles, using kmer-based similarity metrics and a variant auto-encoder (VAE) to overcome ambiguity in sequence alignment between divergent lineages and the complex relationships between lineages in this group. Overall this manuscript is well written and its claims are well supported - I feel it will be of substantial utility to the mosquito research and broader ecology and evolutionary biology communities.

    The authors demonstrate that applying kmer-based similarity with nearest-neighbor based assignment across their amplicon panel can successfully assign samples to coarse-grained taxa, but that this approach has difficulty differentiating more subtly differentiated groups like the Anopheles gambiae complex. They subsequently develop a VAE that successfully differentiates most samples in this complex, and assign new samples within the convex hull defined by the samples of a given group. This approach is successfully applied to a large number of samples from Burkina Faso and Gabon, assigning most samples unambiguously and flagging outlier samples for further genome-wide analysis. These case studies demonstrate the utility and scale-ability of this approach.

    It's not entirely clear from the manuscript how much better (in quantitative terms) this approach is compared to the earlier non-kmer approach or possible alternatives, though it certainly seems to perform much better than the previously published method (Makunin 2022). The approach itself is clearly explained and transparently documented, and it appears to be well suited to the goal of assigning very large numbers of samples accurately at a low cost.