Assessing the combined performance of supervised learning and spike-in constructs for bias correction in eDNA metabarcoding

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Environmental DNA (eDNA) metabarcoding has become increasingly popular as an approach to efficiently document biodiversity within an environment characterized by relative uncertainty. Compared to the traditional stereomicroscopic approaches, eDNA metabarcoding is simpler and less costly. Under ideal circumstances, researchers are able to directly extrapolate the true relative abundance of a particular taxon in the sampled environment by computing the proportion of sequenced reads assigned to the specific taxon. Although several previous studies have been carried out under such assumptions, some researchers have raised the possibility that there may exist both biological and technical biases in eDNA metabarcoding studies, leading to inconsistent estimations of community composition. Using mock community datasets from nine relevant studies in the past, we showed that bias correction in eDNA metabarcoding studies is indeed a predictable task. We also found reads and amp_gc to be the two most important feature predictors, such that these two features alone are enough to retain most of the model performances. Experiment-specific information were found to be necessary for bias correcting models to perform well. However, we have yet to develop an effective way of converting knowledge regarding spike-in (SP) samples into experiment-specific information that can be learned by existing models. Nonetheless, under the data-specific scenario, AdaBoost showed an optimal 35.62% improvement from the baseline established by the vanilla control model. Additionally, we showed that model performances could be rescued by the availability of experiment-specific data, under which XgBoost exhibited an optimal 81.57% improvement from the baseline. Our work suggests that future metabarcoding studies would benefit from performing supervised learning (SL)-based bias correction prior to downstream analyses. Moreover, if experiment-specific data is available at the time of the study, it is optimal to construct an XgBoost model. Otherwise, it is still recommended to construct an AdaBoost model, which showed marginal improvement from the baseline with no modeling.

One Sentence Summary

Supervised learning models, particularly XgBoost and AdaBoost, can effectively correct biases in eDNA metabarcoding studies, with performance improving significantly when experiment-specific data is available.

Article activity feed