Full text
3,057 characters
· extracted from
oa-doi-fallback
· click to expand
Abstract
Environmental DNA (eDNA) metabarcoding has become increasingly popular as an approach to efficiently document biodiversity within an environment characterized by relative uncertainty. Compared to the traditional stereomicroscopic approaches, eDNA metabarcoding is simpler and less costly. Under ideal circumstances, researchers are able to directly extrapolate the true relative abundance of a particular taxon in the sampled environment by computing the proportion of sequenced reads assigned to the specific taxon. Although several previous studies have been carried out under such assumptions, some researchers have raised the possibility that there may exist both biological and technical biases in eDNA metabarcoding studies, leading to inconsistent estimations of community composition. Using mock community datasets from nine relevant studies in the past, we showed that bias correction in eDNA metabarcoding studies is indeed a predictable task. We also found reads and amp_gc to be the two most important feature predictors, such that these two features alone are enough to retain most of the model performances. Experiment-specific information were found to be necessary for bias correcting models to perform well. However, we have yet to develop an effective way of converting knowledge regarding spike-in (SP) samples into experiment-specific information that can be learned by existing models. Nonetheless, under the data-specific scenario, AdaBoost showed an optimal 35.62% improvement from the baseline established by the vanilla control model. Additionally, we showed that model performances could be rescued by the availability of experiment-specific data, under which XgBoost exhibited an optimal 81.57% improvement from the baseline. Our work suggests that future metabarcoding studies would benefit from performing supervised learning (SL)-based bias correction prior to downstream analyses. Moreover, if experiment-specific data is available at the time of the study, it is optimal to construct an XgBoost model. Otherwise, it is still recommended to construct an AdaBoost model, which showed marginal improvement from the baseline with no modeling.
One Sentence Summary Supervised learning models, particularly XgBoost and AdaBoost, can effectively correct biases in eDNA metabarcoding studies, with performance improving significantly when experiment-specific data is available.
Competing Interest Statement
The authors have declared no competing interest.
Abbreviations
- eDNA
- environmental DNA
- PCR
- polymerase chain reaction
- NGS
- next generation sequencing
- rraw
- raw read count
- rnorm
- normalized read count
- rtrue
- true read count
- SL
- supervised learning
- ML
- machine learning
- NMI
- normalized mutual information
- featurei
- the ith feature
- SP
- spike-in samples
- NSP
- non-spike-in samples
- MSE
- mean squared error
- LOOCV
- leave-one-out cross-validation
- αraw
- raw abundance
- αtrue
- true abundance
- Bagging
- Bootstrap AGGregatING method
- PFI
- permutation feature importance algorithm
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.