Assessing the combined performance of supervised learning and spike-in constructs for bias correction in eDNA metabarcoding

doi:10.1101/2025.03.25.645285

Assessing the combined performance of supervised learning and spike-in constructs for bias correction in eDNA metabarcoding

2025 · doi:10.1101/2025.03.25.645285

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 3,057 characters · extracted from oa-doi-fallback · click to expand

Abstract Environmental DNA (eDNA) metabarcoding has become increasingly popular as an approach to efficiently document biodiversity within an environment characterized by relative uncertainty. Compared to the traditional stereomicroscopic approaches, eDNA metabarcoding is simpler and less costly. Under ideal circumstances, researchers are able to directly extrapolate the true relative abundance of a particular taxon in the sampled environment by computing the proportion of sequenced reads assigned to the specific taxon. Although several previous studies have been carried out under such assumptions, some researchers have raised the possibility that there may exist both biological and technical biases in eDNA metabarcoding studies, leading to inconsistent estimations of community composition. Using mock community datasets from nine relevant studies in the past, we showed that bias correction in eDNA metabarcoding studies is indeed a predictable task. We also found reads and amp_gc to be the two most important feature predictors, such that these two features alone are enough to retain most of the model performances. Experiment-specific information were found to be necessary for bias correcting models to perform well. However, we have yet to develop an effective way of converting knowledge regarding spike-in (SP) samples into experiment-specific information that can be learned by existing models. Nonetheless, under the data-specific scenario, AdaBoost showed an optimal 35.62% improvement from the baseline established by the vanilla control model. Additionally, we showed that model performances could be rescued by the availability of experiment-specific data, under which XgBoost exhibited an optimal 81.57% improvement from the baseline. Our work suggests that future metabarcoding studies would benefit from performing supervised learning (SL)-based bias correction prior to downstream analyses. Moreover, if experiment-specific data is available at the time of the study, it is optimal to construct an XgBoost model. Otherwise, it is still recommended to construct an AdaBoost model, which showed marginal improvement from the baseline with no modeling. One Sentence Summary Supervised learning models, particularly XgBoost and AdaBoost, can effectively correct biases in eDNA metabarcoding studies, with performance improving significantly when experiment-specific data is available. Competing Interest Statement The authors have declared no competing interest. Abbreviations - eDNA - environmental DNA - PCR - polymerase chain reaction - NGS - next generation sequencing - rraw - raw read count - rnorm - normalized read count - rtrue - true read count - SL - supervised learning - ML - machine learning - NMI - normalized mutual information - featurei - the ith feature - SP - spike-in samples - NSP - non-spike-in samples - MSE - mean squared error - LOOCV - leave-one-out cross-validation - αraw - raw abundance - αtrue - true abundance - Bagging - Bootstrap AGGregatING method - PFI - permutation feature importance algorithm

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00