Testing the limits of short-reads metagenomic classifications programs in waste water treating microbial communities

preprint OA: closed
Full text JSON View at publisher
Full text 159,967 characters · extracted from preprint-html · click to expand
Testing the limits of short-reads metagenomic classifications programs in waste water treating microbial communities | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Testing the limits of short-reads metagenomic classifications programs in waste water treating microbial communities Leandro Gloria, Matteo Ramazzotti This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6485360/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 05 Jul, 2025 Read the published version in Scientific Reports → Version 1 posted 10 You are reading this latest preprint version Abstract Biological wastewater treatment processes, such as activated sludge (AS) and aerobic granular sludge (AGS), have proven to be crucial systems for achieving both efficient waste purification and the recovery of valuable resources like poly-hydroxy-alkanoates (PHA). Gaining a deeper understanding of the microbial communities underpinning these technologies would enable their optimization, ultimately reducing costs and increasing efficiency. To support this research, we quantitatively compared classification methods differing in read length (raw reads, contigs and MAGs), overall search approach (Kaiju, Kraken2, RiboFrame and kMetaShot), as well as source databases to assess the classification performances at both the genus and species levels using an in silico-generated mock community designed to provide a simplified yet comprehensive representation of the complex microbial ecosystems found in AS and AGS. Particular attention was given to the misclassification of eukaryotes as bacteria and vice versa, as well as the occurrence of false negatives. Notably, Kaiju emerged as the most accurate classifier at both the genus and species levels, followed by RiboFrame and kMetaShot. However, our findings highlight the substantial risk of misclassification across all classifiers and databases, which could significantly hinder the advancement of these technologies by introducing noises and mistakes for key microbial clades. Biological sciences/Biological techniques/Bioinformatics Biological sciences/Biological techniques/Genomic analysis Biological sciences/Biological techniques/Sequencing Biological sciences/Biological techniques Earth and environmental sciences/Ecology Earth and environmental sciences/Environmental sciences Biological sciences/Ecology/Ecological genetics Biological sciences/Ecology/Microbial ecology Biological sciences/Ecology/Molecular ecology Biological sciences/Ecology/Restoration ecology Wastewater Microbial community Classifications Aerobic Granular Sludge Benchmark Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction The benefits derived from the industrial revolutions have fundamentally shaped modern lifestyles, enabling advancements nowadays as fundamental as the ease of reading this very paper. However, a significant downside of industrialization is the increased demand for the removal of carbon (C), nitrogen (N) and phosphorus (P) from civil and industrial wastewaters [ 1 ]. To counterbalance the excessive production of pollutants by industrialized societies, environmental engineering has developed both artificial and biological strategies to restore ecological equilibrium. Among these, the activated sludge (AS) system and its technological evolution, the aerobic granular sludge (AGS) system, are two biological wastewater treatment methods that accelerate processes that would naturally occur over longer timescales [ 2 ]. Both AS and AGS rely on the collective metabolic activities of complex microbial communities, primarily composed of prokaryotes. Bacteria such as Candidatus Accumulibacter and Candidatus Competibacter , the most studied phosphate-accumulating organisms (PAOs) and glycogen-accumulating organisms (GAOs), have long been considered the primary components of AGS systems [ 2 ]. However, metagenomic insights have revealed that other PAOs may be better adapted to specific conditions. For instance, Tetrasphaera relies on a broad metabolic repertoire, allowing it to thrive in environments with low concentrations of readily biodegradable carbon [ 3 ]. Other bacterial genera frequently identified in metagenomic surveys include Zoogloea , Pseudomonas , Thauera , and Flavobacterium [ 4 ]. These bacteria are essential for secreting polysaccharidic matrices that embed PAO and GAO populations [ 4 ] and harbor strains capable of denitrification [ 5 ] [ 6 ] [ 7 ]. Furthermore, the AGS granular biomass enable the co-existence of nitrifying bacteria such as Nitrosomonas and denitrifiers in different layers of the same granule [ 2 ]. Beyond bacteria, viruses, protozoa, and lower metazoans such as nematodes and rotifers play crucial roles in these microbial communities, as they act as bacterivores, horizontal gene transfer vectors (viruses), or contribute to biomass structuring through their movements and secretions (animals) [ 8 ]. Given this complexity, a comprehensive understanding of AGS microbial communities is essential for optimizing reactor performance, accelerating maturation through targeted microbial augmentation, improving depuration efficiency, and even recovering valuable resources from waste streams [ 4 ] [ 9 ]. For instance, various studies aim to enhance the production of poly-hydroxy-alkanoates (PHA), useful for bioplastics production, by adjusting reactor conditions to selectively enrich specific PHA-producing bacterial genera (e.g., Candidatus Accumulibacter, Thaurea and Azoarcus ) [ 4 ] [ 10 ]. However, achieving such optimization first requires a comprehensive understanding of the microbial communities involved, followed by a deeper exploration of their metabolic interactions. Total DNA sequencing is a widely used approach for comprehensive community characterization, involving the sequencing of bulk DNA extracted from reactor biomass followed by bioinformatics ecological analyses. Currently, short-read sequencing technologies dominate the market due to their high throughput, cost-effectiveness, and low error rates [ 11 ], making them well-suited for accurately profiling the complexity of these environmental communities. Whether the case, this approach always relies on bioinformatics classification methods the are currently known to be prone to misclassification errors [ 12 ] [ 13 ], particularly when analysing complex environmental samples. Moreover, most of the benchmark studies provided so far are highly biased against homo sapiens related microbiota that, although valuable in clinical research, lacks specificity in environmental settings. To support AS and AGS microbial communities researches, we evaluated various classification strategies for short-read sequencing (150 bp), including read, assembled contig and MAG based approaches. To explore different algorithmic approaches, this analysis employed four taxonomic classifiers, namely Kaiju [ 14 ], Kraken2 [ 15 ], RiboFrame [ 16 ] and kMetaShot [ 17 ], using multiple settings and databases. These classifiers were chosen for their proven effectiveness in their correspondent classification methodologies: Kaiju translates nucleotide sequences into all six possible open reading frame (ORF) amino acid sequences and performs protein level matching using the Burrows-Wheeler transform [ 14 ]; Kraken2 classifies sequences by analysing the frequency of distinctive k-mer patterns (sequences portions of length "k") [ 15 ]; RiboFrame extracts estimated 16S reads from whole-genome sequencing data and applies k-mer-based bayesian classification specifically to these reads using a dedicated 16S database [ 16 ]; kMetaShot is a k-mer-based classifier tailored for MAGs, utilizing a custom-built database incorporating reference coding sequences, 16S rRNA and tRNA sequences from NCBI [ 17 ]. The evaluation was conducted using a mock community, designed to provide a simplified yet representative model of the complex microbial ecosystems found in AS and AGS systems. The mock was purposely generated in silico to control the exact clade relative abundances and avoid kitome contaminants [ 18 ]. The comparison considered the lack of certain taxa classification due to database limitations and, where possible, also tested custom databases ensuring the presence of relevant AS and AGS associated clades. Additionally, we assessed the risk of misclassifying higher metazoans as bacteria (and vice versa) and evaluated their removal before classification using two widely used decontamination tools, Kraken2 and Bowtie2 [ 19 ] [ 12 ]. Results Mock processing stats After BBDuk filtering, 46.315.875 out of 50.001.759 paired reads (92.6%) remained available for analysis. Kaiju classified between 90% (m = 11) and 80% of these sequences (m = 42) using either its databases, with no variation depending on the E-value when the m parameter was set to 30 or 42. However, about 20% additional sequences were classified as "cannot be assigned to a (non-viral) genus" by Kaiju in every setting, which did not add significant insights. Kraken2, when using the nt_core database, exhibited a strong dependency on confidence thresholds: at 0.05 confidence, it classified about 50% of the reads, whereas at the highest confidence threshold tested, the classified read proportion dropped to below 6%. Kraken2 with the SILVA database significantly reduced classification rates, with less than 2% of reads classified even at the most lenient thresholds. Despite using the same SILVA database, RiboFrame classified between 3000 (V3-V4 16S, confidence 0.9) and 70.000 (full length 16S, confidence 0.8) paired reads across tested settings. MetaBat2 consistently produced about 48 similar MAGs, regardless of the MEGAHIT assembly settings used. kMetaShot classified almost all MAGs (e.g., 41 out of 46) when no confidence threshold was applied. However, classification decreased as the confidence threshold increased: with confidence set to 0.2 kMetaShot classified more than half of the MAGs for each setting while with confidence 0.4 it classified approximately a third of the MAGs. Among the classifiers, RiboFrame was the least demanding in terms of RAM usage, requiring approximately 20 GB. In contrast, Kaiju and Kraken2 each required over 200 GB of RAM. The most memory-intensive approach was kMetaShot, which, when run in a multithreaded mode on MAGs, consumed 24 GB per thread. Comparison at genus-level classification The Fig. 1 below represents the relative abundances of the genera actually featured in the mock among the various settings, while representing the percentages of misclassifications. The supplementary table 2 reports the relative abundances of the mock true genera across the settings. Notably, the only classifier that did not produce erroneous classifications at the genus level was kMetaShot on MAGs, regardless of the confidence levels and MEGAHIT settings. However, the same performance was not observed at the contig level, where many erroneous classifications and missed true genera were observed. Approximately 25% of the classifications from Kaiju and Kraken2 (using the nt core database) were erroneous, with Kaiju showing less dependence on the settings employed, while Kraken2 was strongly influenced by the confidence level. In fact, the percentage of misclassifications with Kraken2 increased at a confidence level of 0.99, indicating that false negative classifications (missed true genera) were more frequent than correct ones. Increasing the Kraken2 confidence level from 0.05 to 0.15 slightly reduced misclassification percentages, although fewer reads from Candidatus Accumulibacter were identified. It is noteworthy that Candidatus Competibacter was detected by Kraken2 at the lower confidence levels although just as traces. The true genus abundances inferred by Kaiju closely mirrored the actual mock proportions with both nr euk and nr euk + databases, although a few clades were missing with nr euk. In particular, the ratio between the relative abundances of the four most abundant genera were successfully captured by Kaiju. Both Kraken2 and Kaiju performed better on reads than on contigs. Kraken2 completely missed the true genus abundances when using the SILVA database. On the other hand, RiboFrame demonstrated the lowest percentage of misclassifications (after kMetaShot on MAGs) and captured most of the mock true abundances (after Kaiju) using the same SILVA database, although overestimating the abundance of Flavobacterium . The performance of the classifiers was compared using the Hellinger ecological distance, visualized through the PCoA in Fig. 2 . Kraken2 classifications using the SILVA database were excluded from the analysis, as their estimated profiles exhibited the greatest deviation from the mock community which dominated the overall variability while obscuring differences among the other samples. When Kraken2 was applied with the nt core database, its estimated profile improved but remained different from the mock, particularly when the confidence threshold was increased or when analysis was performed at contigs level. The pipelines that most closely resembled the mock were kMetaShot on MAGs (especially with the MEGAHIT setting "metalarge"), RiboFrame on full 16S reads (with a confidence level of 0.8) and Kaiju (regardless of settings and database). As expected, RiboFrame exhibited superior performance when applied to the full 16S rRNA gene compared to a single 16S hypervariable region, although the overall classification results remained comparable. Overall, the classifications exhibited greater divergence from the mock profile as classification confidence levels increased. These results were also confirmed when the Bray-Curtis dissimilarity index was applied ( Supplementary Fig. 1 ). Most of the misclassifications in Kaiju were due to observations labelled as "cannot be assigned to a (non-viral) genus" by the software, summarized as "As generic virus" in Fig. 3 . Excluding these, less than 4% of the reads were misclassified by Kaiju. Bradyrhizobium, Pseudomonas, Acinetobacter, Sphingomonas, Stenotrophomonas and Chlamydia were among the most abundant genera incorrectly inferred by Kaiju but absent in the mock. The total amount of these misclassified genera by Kaiju, reduced to about 2% when the minimal query coverage threshold ("m") was set to 42 (Fig. 3 and Supplementary Fig. 2 ). Moreover, increasing the stringency of Kaiju did not result in any loss of genera true positive identifications. No significant differences were observed between the overall amount of misclassifications of Kaiju on nr euk and nr euk +. Kraken2 when applied with the SILVA database erroneously assigned many reads to Pseudomonas , while Kraken2 with the nt core database continued to misclassify reads as Mycobacterium , even at higher classification confidence levels. Additionally, the misclassifications of Kraken2 on the nt core database were significantly reduced when applied at contigs level, albeit this improvement came at the expense of true positive identifications. On the other hand, kMetaShot applied at the contigs level exhibited the highest frequency of misclassifications after Kraken2 on SILVA. In contrast, RiboFrame and kMetaShot were the classifiers with the fewest misclassifications, with kMetaShot on MAGs showing no misclassified genera. Comparison at species-level classifications The Fig. 4 below represents the relative abundances of the species actually featured in the mock among the various settings, while representing the percentages of misclassifications. The supplementary table 3 reports the relative abundances of the mock true species across the settings. Notably, the species distribution estimated by Kaiju closely resembled that of the mock community with both nr euk and nr euk + databases, achieving even greater precision than kMetaShot at this taxonomic level. However, Kaiju still underestimated the relative abundances of few abundant clades, such as Tetrasphaera vanveenii, Thauera sinica and Delftia spp. In contrast, Kraken2 exhibited substantial deviations from the true mock abundances, with the lower confidence threshold increasing sensitivity but leading to almost 50% of misclassified species, while the higher threshold effectively reduced misclassifications but missed reads from Candidatus Accumulibacter and Candidatus Competibacter. Thauera spp., Novosphingobium spp., and Flavobacterium johnsoniae were among the most frequently misclassified taxa across all settings. When disregarding relative abundances, kMetaShot at the MAGs level proved to be the most precise method for taxonomic identification within the community (Fig. 5 A). This result is particularly notable when compared to Kaiju on nr euk +, which reported nearly 1600 erroneous species, and Kraken2 (with confidence threshold set at 0.99) which reported approximately 600 erroneous species (Fig. 5 B). However, it is important to emphasize that most of Kaiju’s misclassifications occurred at very low relative abundances (less than 0.1%), with the exclusion of Thauera sp. and Tetrasphaera sp., with relative abundances of 1% and 1.5%, respectively. Notably, these species misclassifications still belong to clade actually featured in the mock. Furthermore, Fig. 5 highlights the varying sensitivities of the classifier in detecting the true mock species. Kaiju missed only 20 species, followed by Kraken2 with 157 missed species, and lastly, kMetaShot on MAGs. In particular, kMetaShot on MAGs exhibited the lowest sensitivity, missing nearly all of the true species. On the other hand, Kaiju failed to detect phage T4 reads despite being included in its database. On the other hand, Kraken2 recognised phage T4 reads although only with permissive confidences (lower than 0.85). The 15 taxa featured in the mock but not detected by either of the classifications were species belonging to Halomonas , Novosphingobium, Thauera and Paramecium genera, although other species of the same genera were identified. Classification performances of phage T4 and lower metazoan Kaiju successfully identified approximately half of the T4 phage reads when executed using the database containing only viral sequences. However, even under the most stringent settings, over 46 million sequences were misclassified as viruses within this focused database. The eukaryote-specific classifier, EukDetect, accurately identified 23 reads of Diploscapter spp. and did not report any misclassifications after applying its default filtering procedures. However, this high precision came at the cost of a substantial loss in sensitivity, as the majority of eukaryotic sequences remained unclassified. Notably, approximately 300 Novosphingobium aureum sequences were initially misclassified as the fungus Wolfiporia cocos by the first step of EukDetect, which relies on Bowtie2 alignment against the EukDetect database. Furthermore, Kaiju performed on the custom database constructed exclusively with lower metazoan sequences led to excessive false positives. In fact, when Kaiju was used with such focused database, despite successfully identified Paramecium and Diploscapter , it also erroneously classified many other nematodes and rotifers from bacterial and human-derived reads. For instance, reads from nearly every bacterial clade included in this mock were misclassified as Steinernema , and a substantial number of Homo sapiens reads were mistakenly assigned to nematodes. Although applying high-stringency settings significantly reduced these false positives, Kaiju's precision on the lower metazoan database remained relatively low. On the other hand, Kaiju and Kraken2 with complete databases performed better in terms of overall sensibility ( supplementary Fig. 3 ). In fact, Kaiju with nr euk + was able to identify Diploscapter and Homo , maintaining the overall proportions between the clades despite underestimating their relative abundance ( supplementary Fig. 3 ). Moreover, using Kaiju with nr euk + avoided the misclassification of Diploscapter reads as bacteria (observed with nr euk) while conversely only 137 reads of bacterial genera where incorrectly identified as Diploscapter with the most stringent settings. However, also other eukaryotic misclassifications were observed with Kaiju using the nr euk + database. For instance, a small fraction of Novosphingobium and Propionivibrio -derived reads were misclassified as Trichinella (0.003%). Similarly, bacterial and Plasmodium reads were misidentified as fungi ( Termitomyces 0.003%, Wolfiporia 0.001%). Kraken2 with nt core detected Homo , Diploscapter , and phage T4, with only trace amounts of Diploscapter (0.009%) at the most permissive settings. Notably, phage reads were consistently identified even at a confidence threshold of 0.99 while accurately confirming the absence of other viral clades in the mock. However, Kraken2’s high sensitivity came at the cost of increased noise, as it misclassified Novosphingobium and Dechloromonas spp. as Wolfiporia (0.02%) and Gallus gallus (0.1%), respectively, even under the most stringent settings. Homo sapiens reads misclassifications as bacteria and decontamination test Kraken2 on nt core database correctly identified about half of the Homo sapiens reads when performed on low confidence thresholds. Moreover, Kraken2 did not misclassify them as bacteria, correctly recognizing at least the correct clade (e.g., Hominidae , Bilateraria , etc.) or, at worst, it misclassified some as monkey-derived reads (e.g., Catarrhini spp.). In contrast, Kaiju, which performed well overall in the current benchmark, misclassified H. sapiens reads as bacteria (e.g. Enterococcus , Staphylococcus , Pseudomonas , Klebsiella pneumoniae , Acinetobacter baumannii and Escherichia coli ) when used with the nr euk database. Using nr euk +, which includes Homo sapiens reads, allowed Kaiju to correctly identify few Homo reads (less than 10%) but, more importantly, to not mistaken them as bacteria. However, Kaiju frequently misidentified H. sapiens reads as Plasmodium ovale with both its databases. These misclassifications were consistent across the different Kaiju settings, although they were significantly reduced with the most stringent parameters ("E = 0.0001 and m = 42") and when using the nr euk + database. While the total number of H. sapiens reads misclassified by Kaiju was relatively low (10.543 read pairs with nr euk and only 75 with nr euk +, under the most stringent parameters), such errors could lead to incorrect assumptions regarding the presence of certain rare taxa in the community. To address this issue, various decontamination methods for H. sapiens reads, as well as other likely eukaryotic DNA residuals originating from real wastewater, were tested before microbial community classification (Fig. 6 ). Among the tested methods, Bowtie2 demonstrated the highest sensitivity in identifying H. sapiens reads while also reporting a relatively low number of false positives (i.e., microbial reads misclassified as Homo sapiens ), particularly when used with end-to-end alignments which captured about 5000 paired reads misidentified as Homo . Specifically, Bowtie2's false positives primarily consisted of misclassified Propionivibrio, Novosphingobium, Paramecium and Dechloromonas reads. Although slightly less effective, Kraken2 on the GRCh38 human database with a confidence threshold of 0.45 showed comparable performance. Kraken2 surpassed Bowtie2 in precision when the confidence threshold was increased. In fact, Kraken2 misclassified only around 200 reads, mainly from Novosphingobium and Dechloromonas , as H. sapiens at a confidence level of 0.99. However, this improvement in precision came at the expense of a significant reduction in sensitivity, as only about one-third of the true H. sapiens reads (approximately 70.000 out of 240.000) were correctly identified. Moreover, when Kraken2 decontamination was performed using a broader eukaryotic database, the total number of misclassifications increased substantially, resulting in nearly equal numbers of false positives and true positives. Specifically, a large proportion of reads originating from Flavobacterium were misclassified as Ostrinia furnacalis (a hexapod, i.e. an insect), while many Novosphingobium derived reads were incorrectly identified as belonging to the plant genus Elaeis , regardless of the confidence threshold employed. Discussion Current research efforts continue to investigate microbiomes using available sequencing technologies and bioinformatics workflows, however, many of these tools have been validated primarily for human-associated microbiota. Accordingly, this study aimed at systematically evaluate the advantages and limitations of commonly used taxonomic classification approaches following short-read DNA sequencing. A considerable proportion of sequencing reads remained unclassified, even though the genomes composing the mock community were sourced from public databases, highlighting intrinsic classification limitations. The proportion of both unclassified and misclassified reads is expected to increase in real samples, given the higher complexity of real microbial communities and the presence of bacteria not represented in current databases. Among the classifiers tested, Kaiju with either nr euk and nr euk + demonstrated the best performance, capturing the relative abundance ratios of the most prevalent genera and species. However, approximately 25% of its classifications were incorrect, with the majority assigned as "cannot be assigned to a non-generic virus". Such classifications provide limited taxonomic resolution and are nearly as uninformative as the "unclassified" reads. While numerous misclassifications occurred with Kaiju, they were predominantly at very low relative abundances. Most of the species misclassified by Kaiju belonged to genera actually featured in the mock community, meaning that the misclassifications were taxonomically close to the expected assignment. The advantages conferred by Kaiju may stem from its in silico translation approach, which mitigates the impact of single nucleotide errors or mutations on the taxonomic classifications. Although protein databases lack non-coding genomic regions, Kaiju is expected to be relatively effective on bacterial and viral genomes as they predominantly consist of coding sequences [ 14 ] [ 17 ]. kMetaShot ranked as the second-best classifier in terms of overall efficiency. However, it primarily identified only the most abundant taxa while maintaining a high degree of accuracy, meaning that it obtained a high precision at cost of sensitivity. Its lack of sensitivity may be attributed to its database construction methodology, as it was primarily tested on human-associated environments rather than environmental microbiomes [ 17 ]. It is noteworthy that kMetaShot performed on MAGs assembled from short reads, hence both its sensibility and precision are expected to significantly increase in case of long read sequencing. RiboFrame’s estimation of the mock community was almost precise as Kaiju, but exhibited few misclassifications and overestimated Flavobacterium relative abundance, may due to the higher copy number of the 16S rRNA gene in its genome compared to other bacteria such as Candidatus Accumulibacter [ 20 ]. Kraken2, when used with the SILVA database, produced unreliable results, while RiboFrame successfully utilized the same database with minimal noise. Notably, RiboFrame had the lowest RAM requirements, confirming its suitability for short-read DNA sequencing analysis when high-performance computers are not available. Kraken2 used on nt core database pictured a community similar to the mock, but its performance was inferior to the other tested classifiers. This outcome was unexpected, given that Kraken2 is frequently reported as one of the top-performing classifiers in human and soil microbiome studies [ 21 ]. Nevertheless, Kraken2 effectiveness was still confirmed as it managed to obtain unique insights, being the only classifier that successfully detected all the true genera. In detail, at a confidence threshold of 0.05 with the nt core database, Kraken2 exhibited over 25% misclassifications but still managed to identify all clades present in the mock, including the T4 phage, albeit with incorrect abundance estimations for Candidatus Accumulibacter, Zoogloea, and Candidatus Competibacter (Fig. 1 ). The observed inaccuracies are likely attributable to the database rather than the classifier itself, as many microbes associated with AS and AGS systems lack reliable reference genomes. In fact, inspecting the Kraken2 nt core highlights the under representations of many Candidatus Accumulibacter and Candidatus Competibacter species. This limitation was already reported for other Kraken 2 official databases, for example Calderón-Franco et al. found that many AGS related taxa are poorly annotated in Kraken2 standard database [ 22 ]. All classifiers performed poorly when applied to contigs, suggesting suboptimal assembly. Specifically, contig-based classifications resulted in significant underestimation of many clades among which Candidatus Accumulibacter and Candidatus Competibacter, while overestimating others as Novosphingobium . Nevertheless, the contigs served as the basis for generating MAGs, which were classified with high accuracy using kMetaShot. Such contrasting outcomes suggests that the potential information obtained by assembling MAGs was greater than the noise obtained from the assembling in contigs. The most accurate MEGAHIT assembly setting resulted to be the "metalarge" mode, albeit with a marginal improvement. As anticipated, lowering the confidence threshold increased the error rate with every classifier. However, the trade-off between reducing noise and losing valuable information was not favourable. For instance, at lower confidence thresholds, Kraken2 began to miss key species, suggesting that an optimal range its classification lies between 0.05 and 0.3 for AS and AGS related environments. Similar trends were observed for RiboFrame and kMetaShot. Conversely, Kaiju exhibited minimal changes when increasing stringency (Fig. 2 ). Thus, increasing the minimal alignment length threshold ("m") beyond 40 in Kaiju is suggested to further reduce its misclassifications without major losses in sensibility. However, this may result also in minimal loss of sensitivity regards Rotifera , as more than 15.000 proteins known in this clade are shorter than 40 amino-acids according to the actual UniRef database [ 23 ]. The classifier EukDetect2 exhibited perfect precision but suffered from extremely low sensitivity, as only few reads were recognised as Diploscapter . Similarly, Kaiju when used with the nr euk + database effectively detected eukaryotic sequences with good accuracy, albeit missing many. The inclusion of eukaryotic sequences in the classification pipeline was beneficial for every tested classifier. For example, a Kaiju-specific database containing only viral or eukaryotic sequences enhanced the classifier sensitivity but mostly increased its false positive ratios, even classifying many bacteria as eukaryotes. In contrast, Kaiju on nr euk rarely classified Homo or Diploscapter sequences as bacterial or vice versa, and was even more precise when using the nr euk +. Similarly, Kraken2 misclassified nearly all human-derived reads into the correct broad clade when using its complete database. Conversely, a comprehensive yet incomplete or unfocused database may result in a significant loss of information, as classifiers are more likely to assign reads to clades that are not actually present in the sampled environment. Such limit was observed when Kraken2 was used on the eukaryote custom database leading to numerous false positives, such as misclassifications of bacteria as insects. It is important to emphasize that the list of eukaryotes included such custom database should not be considered to be exhaustive of possible eukaryotic contaminants in waste water, but rather as a benchmark for potential misclassifications of microbial sequences. On the other hand, Kraken2 demonstrated superior accuracy in distinguishing human reads from bacterial sequences when using only the GRCh38 database at maximum confidence. Despite Bowtie2 achieved a significantly greater sensitivity in identify Homo reads in our simulations, also mistaken more microbial reads as human compared to Kraken2 used with 0.99 confidence. The decontamination prior to the classification would further reduce false positive classifications, as the Homo sapiens sourced reads are often mistaken for Plasmodium ovale in our simulated scenario. The likelihood of Homo sapiens DNA misclassifications were already reported in literature, for example Marcelino, Holmes and Sorrell highlighted the illogic inferring of reptiles from human gut DNA samples [ 13 ]. However, given the low misclassification rate of Homo sapiens reads with Kaiju ( nr euk + database, stringent settings), decontamination should be carefully considered to avoid losing valuable microbial reads due to rare false positives Homo reads. Consequently, the optimal strategy may depend on sequencing depth (as bacterial reads are typically more abundant than animal or plant derived contaminants) and the nature of the influent feeding the reactor (i.e. real or synthetic wastewater). In real wastewater influent scenarios, particularly those originating from domestic sources, in silico decontamination of human reads using Kraken2 with a focused database at a high confidence threshold may be a viable strategy. Overall, the results highlighted the risks of placing blind trust in classification outputs, particularly when interpreting low-abundance taxa. For instance, rare Dechloromonas reads were erroneously classified as Gallus gallus despite the application of high stringency thresholds, and fungal taxa were inferred despite their absence from the simulated community. While the former misclassification might be reasonably disregarded in practical scenarios due to its implausibility, the latter could misleadingly suggest the presence of fungi in the reactor. Due to the intentionally simplified design of the simulated mock community, it is not possible to define abundance thresholds. Nonetheless, the application of filtering thresholds previously proposed in the literature, such as 0.005% at species level [ 21 , 24 ] (calculated including unclassified reads in the total) is still suggested. Methods Mock generation Reference genomes of 14 bacterial species frequently observed in AGS and activated sludge microbial communities were downloaded from NCBI RefSeq using NCBI Datasets v16.22.0. Additionally, genomes of Candidatus Moranbacteria and Solirubrobacter bacterium 67 − 14 [ 25 ] [ 26 ], which have also been reported in AGS and activated sludge studies, were retrieved from GenBank, as these genera lack official reference genomes. To incorporate microbial eukaryotes and bacteriophages, the reference genomes of Diploscapter spp., Paramecium spp. and a T4 bacteriophage species were also included. Notably, Paramecium spp. were selected as they are the only ciliates with reference genomes available in the NCBI RefSeq database as well as members of the Vorticellaceae family that currently lack genomic data in both RefSeq and GenBank. Furthermore, the reference genome of Homo sapiens was downloaded to account for potential traces of eukaryotic DNA originating from reactor influents. In total, genomes from 20 taxa (16 bacteria, 3 eukaryotes, and 1 virus) were collected. The full list of selected genera is provided in Table 1 . Table 1 List of genera featured in the mock. The column "Synonym" indicates taxon names as listed in the NCBI database when they differ from those in other databases (e.g. SILVA). The "Read Counts" column presents the raw number of paired reads assigned to each clade in the mock dataset, while "Read Percentages" represents their relative abundance. Domain Genus Synonym Reads Counts Reads Percentages Bacteria Candidatus Accumulibacter 7500159 15 Bacteria Candidatus Competibacter 7499920 15 Bacteria Thauera 6000210 12 Bacteria Flavobacterium 4000000 8 Bacteria Candidatus Moranbacteria Candidatus Moraniibacteriota 3999999 8 Bacteria Dechloromonas 2500038 5 Bacteria Nitrosomonas 2500189 5 Bacteria Zoogloea 1999647 4 Bacteria Propionivibrio Candidatus Propionivibrio (at species level) 1999980 4 Bacteria Novosphingobium 2000782 4 Bacteria Tetrasphaera Nostocoides 1999944 4 Bacteria Azoarcus 1999998 4 Bacteria Nitrobacter 1999951 4 Bacteria Delftia 1999998 4 Bacteria 67 − 14 Solirubrobacterales bacterium 67 − 14 500004 1 Bacteria Halomonas 500688 1 none phage T4 Tequatrovirus 250000 0.5 Eukaryota Homo 250275 0.5 Eukaryota Diploscapter 249997 0.5 Eukaryota Paramecium 249980 0.5 Simulated untargeted sequencing of these genomes was performed using InSilicoSeq (ISS) v2.01 [ 27 ], emulating sequencing via Illumina NovaSeq. This resulted in 150 bp paired-end reads at a total depth of 50 million paired reads. The seed 1994 was employed to ensure the full reproducibility of the results (see data availability). The sequencing simulation was designed to generate precise relative abundances for each taxon, as detailed in Table 1 . Notably, the simulated mock community comprises a few predominant bacterial taxa, with others taxa present at lower abundances, thereby reflecting realistic microbial community structures. In particular, Candidatus Accumulibacter and Candidatus Competibacter were the most abundant bacteria, leading to an abundance distribution r resembling AGS communities more than AS communities [ 2 ]. For sake of reading simplicity, the genera featured in the mock will be referred to as "true genera" in this paper. Processing and classifying the mock reads The mock reads were filtered through using BBDuk (module of BBTools suit version 39.06) [ 28 ] to remove reads sourced from Illumina adapters or phiX, very-low complexity sequences with entropy value less than 0.01 ("entropy = 0.01"), 3' ends regions with Q-score lower than 20 ("qtrim = r","trimq = 20") and reads shorter than 100 bp ("minlen = 100") while taking into account the in paired-end nature of the sequencing ("tpe", "tpo"). This pre-processing step was intentionally disregarded when comparing the estimated abundances with the known original ones, in order to incorporate actual sequencing biases into this benchmark. The filtered reads were classified as such or after being assembled into contigs or metagenome-assembled genomes (MAGs). Contigs were assembled using MEGAHIT v1.2.9 [ 29 ] under three different settings: "default", "meta-large" and “custom” (the latter employing a k-mer list of 35, 57, 79, 99, as used in the kMetaShot study [ 17 ]). MAGs were subsequently reconstructed from the contigs for each setting using MetaBat v2.17 [ 30 ]. The MAG assembly followed the same settings as in the kMetaShot study [ 17 ] to ensure full compatibility with this classifier, as the MAGs identification was tested exclusively with kMetaShot. The classification was carried out with the widely used Kraken v2.1.2 [ 15 ] with various confidence levels, Kaiju v1.10 [ 14 ] with different E-value and minimal coverage thresholds, RiboFrame v 1.0 [ 16 ] with different confidence thresholds applied to both the full-length 16S rDNA and its V3-V4 hypervariable region featured among the reads, and kMetaShot v1.0 [ 17 ] with multiple confidence levels. Bracken 2.7.0 [ 31 ] was used to re-estimate the abundances of the Kraken2 identified taxa according to their genome length and sequenced read length. In particular, the classification at read level was conducted with RiboFrame, Kaiju, and Kraken2, at contigs level with Kaiju, Kraken2, and kMetaShot and at MAG level with kMetaShot. The analysis at the contig level using kMetaShot was conducted for each kMetaShot setting described above, whereas classifications with Kaiju and Kraken2 (on "nt core" database) were performed exclusively on contigs generated with the MEGAHIT ‘metalarge’ option to avoid unnecessarily convoluted comparisons between the various settings combinations. Moreover, the confidence thresholds were not used when classifying at contig level with kMetaShot as almost every related confidence score was near zero. Kraken2 was used with both the "nt core" (built on December 28, 2024) and SILVA 138 official databases. Kaiju was tested with the "nr euk" and "nr euk plus" (referred to as "nr euk +") databases. The nr euk database, built in October 2023, is the most recent official distribution including bacteria, archaea, viruses, protozoa and fungi. In contrast, nr euk + is a customized version of this database, built with the most recent NCBI nr available (April 2024) and expanded to incorporate nr sequences from Platyhelminthes, Nematoda, Amoeba, Rotifera, Tardigrada, and Homo sapiens . RiboFrame relied on the RDP classifier retrained on SILVA SSU 138. kMetaShot employed its own database, downloaded in February 2025. Clades without official genus name in NCBI (e.g. Candidatus Moranbacteria) were obtained from the species classifications and added to the genus level outputs to reduce the database biases in the genus level comparison. A comprehensive list of all program, parameter, and database combinations used in this analysis is provided in Supplementary table 1 . Comparison between classifiers outcomes The estimated microbial abundances in the mock datasets were compared across different settings using R v4.3, with the packages vegan v2.6.4 [ 32 ] and ecodist v2.1.3 [ 33 ]. Data visualization was performed using ggplot v3.4.4, ggvenn v0.1.10, and ggh4x v0.2.7. Synonymies across the employed databases were manually resolved, at least for the known genera included in the mock and the most abundant misclassifications, through accurate searches in List of Prokaryotic names with Standing in Nomenclature (LPSN) database [ 34 ]. Importantly, the unclassified reads were not included in the percent abundances computation, hence the analysis was focused on the classifier-specific classifications. Principal Coordinate Analysis (PCoA) was conducted using the Hellinger distance, i.e. the Euclidean distance applied to Hellinger-transformed abundances, to account for the sparse and compositional nature of the data [ 35 ]. Additionally, the Bray-Curtis dissimilarity index, applied to proportional data, was used as an alternative ecological measure to ensure that the PCoA related conclusions were not influenced by the choice of ecological distance. The most abundant misclassifications for each setting were identified by computing the average abundances of taxa that were incorrectly assigned as not actually present in the mock. All the analyses were primarily conducted at the genus level across all described settings, with additional species-level insights obtained by comparing Kaiju outputs at the reads level (using both the databases with settings E = 0.00001 and m = 42), Kraken2 at the reads level (using the nt core database with confidence thresholds of 0.15 and 0.99) and kMetaShot at the MAGs level (after contings assembly through MEGAHIT with “metalarge” option). These programs and settings were specifically chosen for the comparison at species levels as theoretically capable of providing such taxonomic detail and due to their generally accurate performances at genus level. Focus on non-prokaryotes derived reads In addition to the listed software and parameter combinations used for classifying the bulk community, additional analyses were conducted to specifically assess potential misclassifications of non-prokaryotic reads. Read-level classification was performed using Kaiju v1.10 with the pre-built viral sequence database from RefSeq to further investigate false negative classifications of this clade observed in the full database. Additionally, Kaiju was executed with a custom database constructed by selecting only common metazoan sequences found in activated sludge (i.e. Rotifers, Platyhelminths, Nematodes, Amoebae and Tardigrades) from the UniRef100 protein database. Furthermore, EukDetect v1.3 [ 36 ] was applied to unfiltered reads using its default database, EukDetect database v9, which has included lower metazoans since recent releases. Finally, an additional attention was spent on misclassifications of Homo sapiens reads as bacteria. The identification of Homo sapiens reads (as optional decontamination step prior to the actual microbes’ classification) was performed with Kraken2 on both GRCh38 reference genome [ 37 ] and a custom database on diverse confidence levels, and with Bowtie2 [ 19 ] in paired-end mode with the “very-sensitive” option using both local alignment and end-to-end alignment. The custom database used in Kraken2 was constructed from the reference genomes of various higher eukaryotes whose residual DNA fragments are likely to be present in waste water feeding AGS and AS reactors, including Hexapoda, Annelida, Chlorophyta, plants (Kraken2 reference sequences), Homo sapiens and Mus musculus . Declarations Data availability The simulated mock community raw FASTQ are publicly available on NCBI SRA with the accession code PRJNA1252002. The resulting counts for each classifier, the R data containing the feature table ready for the analysis and the scripts are available at https://github.com/LeandroD94/Papers/tree/main/2025_Benchmark_DNAseq_classifiers_AGS_and_AS . Author contributions D.L.: design, analysis, writing. R.M.: supervision, review and editing. References Robles, Á. et al. New frontiers from removal to recycling of nitrogen and phosphorus from wastewater in the Circular Economy. Bioresour. Technol. 300 , 122673. https://doi.org/10.1016/j.biortech.2019.122673 (2020). Campo, R. et al. Efficient carbon, nitrogen and phosphorus removal from low C/N real domestic wastewater with aerobic granular sludge. Bioresour. Technol. 305 , 122961. https://doi.org/10.1016/j.biortech.2020.122961 (2020). Zhang, Y. et al. A review of the phosphorus removal of polyphosphate-accumulating organisms in natural and engineered systems. Sci. Total Environ. 912 , 169103. 10.1016/j.scitotenv.2023.169103 (2024). Winkler, M. K. H. et al. An integrative review of granular sludge for the biological removal of nutrients and recalcitrant organic matter from wastewater. Chem. Eng. J. 336 , 489–502. https://doi.org/10.1016/j.cej.2017.12.026 (2018). Su, J. F., Li, G. Q., Huang, T. L. & Xue, L. The mixotrophic denitrification characteristics of Zoogloea sp. L2 accelerated by the redox mediator of 2-hydroxy-1,4-naphthoquinone. Bioresour. Technol. 311 , 123533. https://doi.org/10.1016/j.biortech.2020.123533 (2020). Zhang, M., Li, A., Yao, Q., Xiao, B. & Zhu, H. Pseudomonas oligotrophica sp. nov., a Novel Denitrifying Bacterium Possessing Nitrogen Removal Capability Under Low Carbon–Nitrogen Ratio Condition. Volume 13–2022, (2022). 10.3389/fmicb.2022.882890 Ye, J. et al. Denitrifying communities enriched with mixed nitrogen oxides preferentially reduce N2O under conditions of electron competition in wastewater. Chem. Eng. J. 498 , 155292. https://doi.org/10.1016/j.cej.2024.155292 (2024). Wilén, B. M., Liébana, R., Persson, F., Modin, O. & Hermansson, M. The mechanisms of granulation of activated sludge in wastewater treatment, its optimization, and impact on effluent quality. Appl. Microbiol. Biotechnol. 102 , 5005–5020. 10.1007/s00253-018-8990-9 (2018). Ekholm, J. et al. Microbiome structure and function in parallel full-scale aerobic granular sludge and activated sludge processes. Appl. Microbiol. Biotechnol. 108 10.1007/s00253-024-13165-8 (2024). Falcioni, S. et al. in Resource Recovery from Wastewater Treatment. (eds Giorgio Mannina, Alida Cosenza, & Antonio Mineo) 140–146 (Springer Nature Switzerland). Adewale, B. A. Will long-read sequencing technologies replace short-read sequencing technologies in the next 10 years? Afr. J. Lab. Med. 9 10.4102/ajlm.v9i1.1340 (2020). Bush, S. J., Connor, T. R., Peto, T. E. A., Crook, D. W. & Walker, A. S. Evaluation of methods for detecting human reads in microbial sequencing datasets. Microb. genomics . 6 10.1099/mgen.0.000393 (2020). Chorlton, S. D. Ten common issues with reference sequence databases and how to mitigate them. Front. Bioinf. 4 , 1278228. 10.3389/fbinf.2024.1278228 (2024). Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7 , 11257. 10.1038/ncomms11257 (2016). Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20 , 257. 10.1186/s13059-019-1891-0 (2019). Ramazzotti, M., Berná, L., Donati, C. & Cavalieri, D. riboFrame: An Improved Method for Microbial Taxonomy Profiling from Non-Targeted Metagenomics. Front. Genet. 6 , 329. 10.3389/fgene.2015.00329 (2015). Defazio, G., Tangaro, M. A., Pesole, G. & Fosso, B. kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes. Brief. Bioinform. 26 10.1093/bib/bbae680 (2025). Di Gloria, L. et al. Experimental tests challenge the evidence of a healthy human blood microbiome. FEBS J. 292 , 796–808. 10.1111/febs.17362 (2025). Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods . 9 , 357–359. 10.1038/nmeth.1923 (2012). Dueholm, M. K. D. et al. MiDAS 5: Global diversity of bacteria and archaea in anaerobic digesters. Nat. Commun. 15 , 5361. 10.1038/s41467-024-49641-y (2024). Edwin, N. R., Fitzpatrick, A. H., Brennan, F., Abram, F. & O’Sullivan, O. An in-depth evaluation of metagenomic classifiers for soil microbiomes. Environ. Microbiome . 19 , 19. 10.1186/s40793-024-00561-w (2024). Calderón-Franco, D. et al. Metagenomic profiling and transfer dynamics of antibiotic resistance determinants in a full-scale granular sludge wastewater treatment plant. Water Res. 219 , 118571. https://doi.org/10.1016/j.watres.2022.118571 (2022). The UniProt, C. UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Res. 53 , D609–D617. 10.1093/nar/gkae1010 (2025). Amos, G. C. A. et al. Developing standards for the microbiome field. Microbiome 8 , 98. 10.1186/s40168-020-00856-3 (2020). Gu, Y., Li, B., Zhong, X., Liu, C. & Ma, B. Bacterial Community Composition and Function in a Tropical Municipal Wastewater Treatment Plant. 14 , 1537 (2022). Xin, Z., Yang, L. & Yang, L. Divergences of granules and flocs microbial communities and contributions to nitrogen removal under varied carbon to nitrogen ratios. Bioresour. Technol. 425 , 132226. https://doi.org/10.1016/j.biortech.2025.132226 (2025). Gourlé, H., Karlsson-Lindsjö, O., Hayer, J. & Bongcam-Rudloff, E. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics (Oxford, England) 35, 521–522, (2018). 10.1093/bioinformatics/bty630%J Bioinformatics. Bushnell, B., Rood, J. & Singer, E. BBMerge – Accurate paired shotgun read merging via overlap. PLoS ONE . 12 , e0185056–e0185056. 10.1371/journal.pone.0185056 (2017). Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinf. (Oxford England) . 31 , 1674–1676. 10.1093/bioinformatics/btv033 (2015). Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7 , e7359. 10.7717/peerj.7359 (2019). Lu, J., Breitwieser, F. P., Thielen, P., Salzberg, S. L. & Bracken Estimating species abundance in metagenomics data. 051813 , (2016). 10.1101/051813%J bioRxiv Vegan Community Ecology Package (2017). Goslee, S. C. & Urban, D. L. The ecodist Package for Dissimilarity-based Analysis of Ecological Data. J. Stat. Softw. 22 , 1–19. 10.18637/jss.v022.i07 (2007). Parte, A. C., Sardà Carbasse, J., Meier-Kolthoff, J. P., Reimer, L. C. & Göker, M. List of Prokaryotic names with Standing in Nomenclature (LPSN) moves to the DSMZ. 70 , 5607–5612, (2020). https://doi.org/10.1099/ijsem.0.004332 Legendre, P. & Legendre, L. J. D. i. E. M. Chapter 7 – Ecological resemblance. 24, 265–335 (2012). Lind, A. L. & Pollard, K. S. Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing. Microbiome 9 , 58. 10.1186/s40168-021-01015-y (2021). Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. 072116, (2016). 10.1101/072116%J bioRxiv Additional Declarations No competing interests reported. Supplementary Files supplementarytable1.xlsx supplementarytable2.csv supplementarytable3.csv supplementaryfigure2.png supplementaryfigure1.png supplementaryfigure3.png Supplementarylegends.docx Cite Share Download PDF Status: Published Journal Publication published 05 Jul, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 03 Jun, 2025 Reviews received at journal 31 May, 2025 Reviews received at journal 18 May, 2025 Reviewers agreed at journal 08 May, 2025 Reviewers agreed at journal 07 May, 2025 Reviewers invited by journal 05 May, 2025 Editor assigned by journal 24 Apr, 2025 Editor invited by journal 21 Apr, 2025 Submission checks completed at journal 19 Apr, 2025 First submitted to journal 19 Apr, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6485360","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":453429230,"identity":"ff6c1b5c-084e-4aab-b791-09e06713263d","order_by":0,"name":"Leandro Gloria","email":"","orcid":"","institution":"University of Florence","correspondingAuthor":false,"prefix":"","firstName":"Leandro","middleName":"","lastName":"Gloria","suffix":""},{"id":453429231,"identity":"e6c38142-59ac-4ce1-a525-040e3fd9b1fa","order_by":1,"name":"Matteo Ramazzotti","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABBUlEQVRIiWNgGAWjYEgANvYGBoYEGwkGNpxKmBkbkHgGEmw8B4Ba0oBacOpB18IgkQCk00D2Ydeg237++IMPFQz5/NMOP/74o+JPHZ/kG7MHDxIsGPjkG7BqMTuTzNg44wyD5YzbaQYGEmeADpPOMTdISMDtMLMDyYzNvG0MBgy3EwwSDNvAWswkEn/g0XL+MWPzX6AW+dvpHw4kgrRInjGTwGvLDaAtjEAtBrdzDBsOgrRI8BDS8thwZs8ZCQPD2znFjA1njCXbeNLKQFp42NgScDgs8cGHHxU2BnK30zcDQ0yOX7798DbJHwl1cvLNB7BbAwESmEI8+NSPglEwCkbBKMAPAE/iUc3rDwKRAAAAAElFTkSuQmCC","orcid":"","institution":"University of Florence","correspondingAuthor":true,"prefix":"","firstName":"Matteo","middleName":"","lastName":"Ramazzotti","suffix":""}],"badges":[],"createdAt":"2025-04-19 14:38:24","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6485360/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6485360/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-07734-8","type":"published","date":"2025-07-05T15:58:41+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":82259634,"identity":"867a50bc-bf93-419a-8346-a69ecd433622","added_by":"auto","created_at":"2025-05-08 11:49:16","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":225509,"visible":true,"origin":"","legend":"\u003cp\u003eBar plots depicting the relative abundances of the genera present in the mock community as estimated by various programs and parameter settings. The column “T” displays the true abundances of the clades in the mock. Genera inferred but not actually included in the mock are categorized as “misclassified”. The x-axis represents the classification types: “E” denotes the E-value threshold, “m” indicates the coverage threshold, and “c” represents the confidence level of the classification, depending on the classifier available options. The prefix “contigs” specifies classifications based on contigs rather than individual reads in case of Kaiju and Kraken2. The prefixes “default”, “metalarge” and “custom” refer to the different MEGAHIT assembly settings.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/fd2c819cb73b6608c3f01cbb.png"},{"id":82259631,"identity":"d49a2b90-f000-4b80-a048-36548b7e34d9","added_by":"auto","created_at":"2025-05-08 11:49:16","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":168814,"visible":true,"origin":"","legend":"\u003cp\u003ePCoA plot illustrating the similarity between classification profiles based on Hellinger distance. Colors indicate the program, database, and classification level (read level by default, with additional specifications for contigs or MAGs level classifications where applicable). Labels on each point denote the specific settings used for the corresponding classification. In detail: “E” denotes the E-value threshold, “m” indicates the coverage threshold, and “c” represents the confidence level of the classification, depending on the classifier available options. The prefix “contigs” specifies classifications based on contigs rather than individual reads in case of Kaiju and Kraken2. The prefixes “default”, “large” (metalarge) and “custom” refer to the different MEGAHIT assembly settings.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/49d0a13e164180b0a2f9de00.png"},{"id":82260755,"identity":"ceb5f8e4-6504-48d1-9bec-1b65c4e8f4c0","added_by":"auto","created_at":"2025-05-08 11:57:16","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":225797,"visible":true,"origin":"","legend":"\u003cp\u003eBar plots illustrating the relative abundances of the most abundant misclassified genera across classifications obtained using various programs and parameter settings. The displayed relative abundances account for the correct classification counts (not shown in this plot), thereby representing the total extent of misclassifications. The x-axis represents the classification types: “E” denotes the E-value threshold, “m” indicates the coverage threshold, and “c” represents the confidence level of the classification, depending on the classifier available options. The prefix “contigs” specifies classifications based on contigs rather than individual reads in case of Kaiju and Kraken2. The prefixes “default”, “metalarge” and “custom” refer to the different MEGAHIT assembly settings.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/d373f23761c0fcd40b9cf7b1.png"},{"id":82259639,"identity":"354ccf34-5a5d-4395-b3be-9234fa8880a2","added_by":"auto","created_at":"2025-05-08 11:49:16","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":138864,"visible":true,"origin":"","legend":"\u003cp\u003eBar plots depicting the relative abundances of the species present in the mock community as estimated by various programs and parameter settings. The column “True” displays the correct abundances of the clades in the mock. The less abundant true species are clustered as a unique observation defined “Others”. Species inferred but not actually included in the mock are categorized as “misclassified”. The x-axis represents the classification types: “E” denotes the E-value threshold, “m” indicates the coverage threshold and “c” represents the confidence level of the classification, depending on the classifier available options. The prefix “contigs” specifies classifications based on contigs rather than individual reads in case of Kaiju and Kraken2. The prefixes “default”, “metalarge” and “custom” refer to the different MEGAHIT assembly settings.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/83066725fa5aa5563adfe1f2.png"},{"id":82259638,"identity":"bf80c848-ef2f-4070-a1df-698e9d84f36e","added_by":"auto","created_at":"2025-05-08 11:49:16","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":124068,"visible":true,"origin":"","legend":"\u003cp\u003eVenn diagrams illustrating the species present in the mock dataset (“True” observation group) and those identified by Kaiju (E=0.00001 and m=42, nr euk + database), Kraken2 (confidence level = 0.99, nt core database), and kMetaShot (confidence = 0.2, executed on MAGs assembled from contigs generated by MEGAHIT with the “metalarge” option). Panel A highlights the comparison between the mock, Kaiju, and kMetaShot observations, while Panel B focuses on the comparison between the mock, Kaiju, and Kraken2 observations.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/10eae9d2982957c9349ec713.png"},{"id":82260756,"identity":"dda27bab-3684-48cf-9916-88df1e1e8a21","added_by":"auto","created_at":"2025-05-08 11:57:16","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":193603,"visible":true,"origin":"","legend":"\u003cp\u003eBar plots illustrating the number of false positives (“Microbes as Homo”) and true positives (“Homo as Homo”) in the identification of \u003cem\u003eHomo sapiens\u003c/em\u003e reads by various programs and parameter settings. The observation labelled “Total misclassific” represents the erroneous classification of microbial reads into any eukaryotic clade when the employed database includes taxa beyond \u003cem\u003eHomo sapiens\u003c/em\u003ealone. The row labelled “Mock” indicates the actual number of \u003cem\u003eH. sapiens\u003c/em\u003e paired reads present in the mock dataset. The program names, employed databases, and settings are displayed on the left side of the plot, while the X-axis represents the number of paired reads for each observation. The X-axis is magnified at lower values to improve readability.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/4dfa2d7068c5cb6833973a25.png"},{"id":86180043,"identity":"2188cf2e-6a6c-45d2-aad5-e16f2665a8e2","added_by":"auto","created_at":"2025-07-07 16:21:03","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1768205,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/ccf637db-cd44-4315-8c08-4eea1bd1da0c.pdf"},{"id":82260752,"identity":"d158ac66-9fa7-4e8e-b9d9-c2038bc805dd","added_by":"auto","created_at":"2025-05-08 11:57:16","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":12682,"visible":true,"origin":"","legend":"","description":"","filename":"supplementarytable1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/3bb7e404d0d1e22b0205996a.xlsx"},{"id":82260751,"identity":"e428aa36-4a0e-4fc4-ae11-34ea23bf672a","added_by":"auto","created_at":"2025-05-08 11:57:16","extension":"csv","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":5736,"visible":true,"origin":"","legend":"","description":"","filename":"supplementarytable2.csv","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/481ff1948d1ee427972b8d83.csv"},{"id":82260753,"identity":"06f5d37f-d426-4c5e-af9f-10ad3386e9a3","added_by":"auto","created_at":"2025-05-08 11:57:16","extension":"csv","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":1681,"visible":true,"origin":"","legend":"","description":"","filename":"supplementarytable3.csv","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/caf98e79849c239e9e588041.csv"},{"id":82259643,"identity":"51252113-0523-4afd-bc81-919d0dd94da0","added_by":"auto","created_at":"2025-05-08 11:49:16","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":140139,"visible":true,"origin":"","legend":"","description":"","filename":"supplementaryfigure2.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/829b3894dfd76755eb4c8f0d.png"},{"id":82260758,"identity":"1c7dfb62-142c-43af-8ef0-854614163942","added_by":"auto","created_at":"2025-05-08 11:57:16","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":163124,"visible":true,"origin":"","legend":"","description":"","filename":"supplementaryfigure1.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/047ef99f13d679943307c98e.png"},{"id":82259658,"identity":"abc65d33-1d23-4c67-b5b3-c1f0c5e6eb33","added_by":"auto","created_at":"2025-05-08 11:49:17","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":141883,"visible":true,"origin":"","legend":"","description":"","filename":"supplementaryfigure3.png","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/0696370af7673d8cc3c28d5e.png"},{"id":82259644,"identity":"9f31f19b-9751-411f-9240-22ce8ebbda23","added_by":"auto","created_at":"2025-05-08 11:49:16","extension":"docx","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":12734,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementarylegends.docx","url":"https://assets-eu.researchsquare.com/files/rs-6485360/v1/e09ac4c78efadc1c6245beee.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Testing the limits of short-reads metagenomic classifications programs in waste water treating microbial communities","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe benefits derived from the industrial revolutions have fundamentally shaped modern lifestyles, enabling advancements nowadays as fundamental as the ease of reading this very paper. However, a significant downside of industrialization is the increased demand for the removal of carbon (C), nitrogen (N) and phosphorus (P) from civil and industrial wastewaters [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. To counterbalance the excessive production of pollutants by industrialized societies, environmental engineering has developed both artificial and biological strategies to restore ecological equilibrium. Among these, the activated sludge (AS) system and its technological evolution, the aerobic granular sludge (AGS) system, are two biological wastewater treatment methods that accelerate processes that would naturally occur over longer timescales [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Both AS and AGS rely on the collective metabolic activities of complex microbial communities, primarily composed of prokaryotes. Bacteria such as \u003cem\u003eCandidatus Accumulibacter\u003c/em\u003e and \u003cem\u003eCandidatus Competibacter\u003c/em\u003e, the most studied phosphate-accumulating organisms (PAOs) and glycogen-accumulating organisms (GAOs), have long been considered the primary components of AGS systems [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. However, metagenomic insights have revealed that other PAOs may be better adapted to specific conditions. For instance, \u003cem\u003eTetrasphaera\u003c/em\u003e relies on a broad metabolic repertoire, allowing it to thrive in environments with low concentrations of readily biodegradable carbon [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOther bacterial genera frequently identified in metagenomic surveys include \u003cem\u003eZoogloea\u003c/em\u003e, \u003cem\u003ePseudomonas\u003c/em\u003e, \u003cem\u003eThauera\u003c/em\u003e, and \u003cem\u003eFlavobacterium\u003c/em\u003e [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. These bacteria are essential for secreting polysaccharidic matrices that embed PAO and GAO populations [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] and harbor strains capable of denitrification [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e] [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Furthermore, the AGS granular biomass enable the co-existence of nitrifying bacteria such as \u003cem\u003eNitrosomonas\u003c/em\u003e and denitrifiers in different layers of the same granule [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Beyond bacteria, viruses, protozoa, and lower metazoans such as nematodes and rotifers play crucial roles in these microbial communities, as they act as bacterivores, horizontal gene transfer vectors (viruses), or contribute to biomass structuring through their movements and secretions (animals) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eGiven this complexity, a comprehensive understanding of AGS microbial communities is essential for optimizing reactor performance, accelerating maturation through targeted microbial augmentation, improving depuration efficiency, and even recovering valuable resources from waste streams [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. For instance, various studies aim to enhance the production of poly-hydroxy-alkanoates (PHA), useful for bioplastics production, by adjusting reactor conditions to selectively enrich specific PHA-producing bacterial genera (e.g., \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter, \u003cem\u003eThaurea\u003c/em\u003e and \u003cem\u003eAzoarcus\u003c/em\u003e) [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. However, achieving such optimization first requires a comprehensive understanding of the microbial communities involved, followed by a deeper exploration of their metabolic interactions. Total DNA sequencing is a widely used approach for comprehensive community characterization, involving the sequencing of bulk DNA extracted from reactor biomass followed by bioinformatics ecological analyses. Currently, short-read sequencing technologies dominate the market due to their high throughput, cost-effectiveness, and low error rates [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e], making them well-suited for accurately profiling the complexity of these environmental communities.\u003c/p\u003e \u003cp\u003eWhether the case, this approach always relies on bioinformatics classification methods the are currently known to be prone to misclassification errors [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], particularly when analysing complex environmental samples. Moreover, most of the benchmark studies provided so far are highly biased against homo sapiens related microbiota that, although valuable in clinical research, lacks specificity in environmental settings. To support AS and AGS microbial communities researches, we evaluated various classification strategies for short-read sequencing (150 bp), including read, assembled contig and MAG based approaches. To explore different algorithmic approaches, this analysis employed four taxonomic classifiers, namely Kaiju [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e], Kraken2 [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], RiboFrame [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] and kMetaShot [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], using multiple settings and databases. These classifiers were chosen for their proven effectiveness in their correspondent classification methodologies:\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003eKaiju translates nucleotide sequences into all six possible open reading frame (ORF) amino acid sequences and performs protein level matching using the Burrows-Wheeler transform [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e];\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eKraken2 classifies sequences by analysing the frequency of distinctive k-mer patterns (sequences portions of length \"k\") [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e];\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003eRiboFrame extracts estimated 16S reads from whole-genome sequencing data and applies k-mer-based bayesian classification specifically to these reads using a dedicated 16S database [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e];\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003ekMetaShot is a k-mer-based classifier tailored for MAGs, utilizing a custom-built database incorporating reference coding sequences, 16S rRNA and tRNA sequences from NCBI [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003eThe evaluation was conducted using a mock community, designed to provide a simplified yet representative model of the complex microbial ecosystems found in AS and AGS systems. The mock was purposely generated in silico to control the exact clade relative abundances and avoid kitome contaminants [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. The comparison considered the lack of certain taxa classification due to database limitations and, where possible, also tested custom databases ensuring the presence of relevant AS and AGS associated clades. Additionally, we assessed the risk of misclassifying higher metazoans as bacteria (and vice versa) and evaluated their removal before classification using two widely used decontamination tools, Kraken2 and Bowtie2 [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eMock processing stats\u003c/h2\u003e \u003cp\u003eAfter BBDuk filtering, 46.315.875 out of 50.001.759 paired reads (92.6%) remained available for analysis. Kaiju classified between 90% (m\u0026thinsp;=\u0026thinsp;11) and 80% of these sequences (m\u0026thinsp;=\u0026thinsp;42) using either its databases, with no variation depending on the E-value when the m parameter was set to 30 or 42. However, about 20% additional sequences were classified as \"cannot be assigned to a (non-viral) genus\" by Kaiju in every setting, which did not add significant insights. Kraken2, when using the nt_core database, exhibited a strong dependency on confidence thresholds: at 0.05 confidence, it classified about 50% of the reads, whereas at the highest confidence threshold tested, the classified read proportion dropped to below 6%. Kraken2 with the SILVA database significantly reduced classification rates, with less than 2% of reads classified even at the most lenient thresholds. Despite using the same SILVA database, RiboFrame classified between 3000 (V3-V4 16S, confidence 0.9) and 70.000 (full length 16S, confidence 0.8) paired reads across tested settings. MetaBat2 consistently produced about 48 similar MAGs, regardless of the MEGAHIT assembly settings used. kMetaShot classified almost all MAGs (e.g., 41 out of 46) when no confidence threshold was applied. However, classification decreased as the confidence threshold increased: with confidence set to 0.2 kMetaShot classified more than half of the MAGs for each setting while with confidence 0.4 it classified approximately a third of the MAGs. Among the classifiers, RiboFrame was the least demanding in terms of RAM usage, requiring approximately 20 GB. In contrast, Kaiju and Kraken2 each required over 200 GB of RAM. The most memory-intensive approach was kMetaShot, which, when run in a multithreaded mode on MAGs, consumed 24 GB per thread.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eComparison at genus-level classification\u003c/h3\u003e\n\u003cp\u003eThe Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e below represents the relative abundances of the genera actually featured in the mock among the various settings, while representing the percentages of misclassifications. The \u003cb\u003esupplementary table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003c/b\u003e reports the relative abundances of the mock true genera across the settings.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eNotably, the only classifier that did not produce erroneous classifications at the genus level was kMetaShot on MAGs, regardless of the confidence levels and MEGAHIT settings. However, the same performance was not observed at the contig level, where many erroneous classifications and missed true genera were observed. Approximately 25% of the classifications from Kaiju and Kraken2 (using the nt core database) were erroneous, with Kaiju showing less dependence on the settings employed, while Kraken2 was strongly influenced by the confidence level. In fact, the percentage of misclassifications with Kraken2 increased at a confidence level of 0.99, indicating that false negative classifications (missed true genera) were more frequent than correct ones. Increasing the Kraken2 confidence level from 0.05 to 0.15 slightly reduced misclassification percentages, although fewer reads from \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter were identified. It is noteworthy that \u003cem\u003eCandidatus\u003c/em\u003e Competibacter was detected by Kraken2 at the lower confidence levels although just as traces. The true genus abundances inferred by Kaiju closely mirrored the actual mock proportions with both nr euk and nr euk\u0026thinsp;+\u0026thinsp;databases, although a few clades were missing with nr euk. In particular, the ratio between the relative abundances of the four most abundant genera were successfully captured by Kaiju. Both Kraken2 and Kaiju performed better on reads than on contigs. Kraken2 completely missed the true genus abundances when using the SILVA database. On the other hand, RiboFrame demonstrated the lowest percentage of misclassifications (after kMetaShot on MAGs) and captured most of the mock true abundances (after Kaiju) using the same SILVA database, although overestimating the abundance of \u003cem\u003eFlavobacterium\u003c/em\u003e.\u003c/p\u003e \u003cp\u003eThe performance of the classifiers was compared using the Hellinger ecological distance, visualized through the PCoA in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Kraken2 classifications using the SILVA database were excluded from the analysis, as their estimated profiles exhibited the greatest deviation from the mock community which dominated the overall variability while obscuring differences among the other samples. When Kraken2 was applied with the \u003cem\u003ent core\u003c/em\u003e database, its estimated profile improved but remained different from the mock, particularly when the confidence threshold was increased or when analysis was performed at contigs level. The pipelines that most closely resembled the mock were kMetaShot on MAGs (especially with the MEGAHIT setting \"metalarge\"), RiboFrame on full 16S reads (with a confidence level of 0.8) and Kaiju (regardless of settings and database). As expected, RiboFrame exhibited superior performance when applied to the full 16S rRNA gene compared to a single 16S hypervariable region, although the overall classification results remained comparable. Overall, the classifications exhibited greater divergence from the mock profile as classification confidence levels increased. These results were also confirmed when the Bray-Curtis dissimilarity index was applied (\u003cb\u003eSupplementary Fig.\u0026nbsp;1\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eMost of the misclassifications in Kaiju were due to observations labelled as \"cannot be assigned to a (non-viral) genus\" by the software, summarized as \"As generic virus\" in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Excluding these, less than 4% of the reads were misclassified by Kaiju. \u003cem\u003eBradyrhizobium, Pseudomonas, Acinetobacter, Sphingomonas, Stenotrophomonas\u003c/em\u003e and \u003cem\u003eChlamydia\u003c/em\u003e were among the most abundant genera incorrectly inferred by Kaiju but absent in the mock. The total amount of these misclassified genera by Kaiju, reduced to about 2% when the minimal query coverage threshold (\"m\") was set to 42 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and \u003cb\u003eSupplementary Fig.\u0026nbsp;2\u003c/b\u003e). Moreover, increasing the stringency of Kaiju did not result in any loss of genera true positive identifications. No significant differences were observed between the overall amount of misclassifications of Kaiju on nr euk and nr euk +. Kraken2 when applied with the SILVA database erroneously assigned many reads to \u003cem\u003ePseudomonas\u003c/em\u003e, while Kraken2 with the nt core database continued to misclassify reads as \u003cem\u003eMycobacterium\u003c/em\u003e, even at higher classification confidence levels. Additionally, the misclassifications of Kraken2 on the nt core database were significantly reduced when applied at contigs level, albeit this improvement came at the expense of true positive identifications. On the other hand, kMetaShot applied at the contigs level exhibited the highest frequency of misclassifications after Kraken2 on SILVA. In contrast, RiboFrame and kMetaShot were the classifiers with the fewest misclassifications, with kMetaShot on MAGs showing no misclassified genera.\u003c/p\u003e\n\u003ch3\u003eComparison at species-level classifications\u003c/h3\u003e\n\u003cp\u003eThe Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e4\u003c/span\u003e below represents the relative abundances of the species actually featured in the mock among the various settings, while representing the percentages of misclassifications. The \u003cb\u003esupplementary table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003e3\u003c/span\u003e\u003c/b\u003e reports the relative abundances of the mock true species across the settings.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eNotably, the species distribution estimated by Kaiju closely resembled that of the mock community with both nr euk and nr euk\u0026thinsp;+\u0026thinsp;databases, achieving even greater precision than kMetaShot at this taxonomic level. However, Kaiju still underestimated the relative abundances of few abundant clades, such as \u003cem\u003eTetrasphaera vanveenii, Thauera sinica\u003c/em\u003e and \u003cem\u003eDelftia\u003c/em\u003e spp. In contrast, Kraken2 exhibited substantial deviations from the true mock abundances, with the lower confidence threshold increasing sensitivity but leading to almost 50% of misclassified species, while the higher threshold effectively reduced misclassifications but missed reads from \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter and \u003cem\u003eCandidatus\u003c/em\u003e Competibacter.\u003c/p\u003e \u003cp\u003e \u003cem\u003eThauera\u003c/em\u003e spp., \u003cem\u003eNovosphingobium\u003c/em\u003e spp., and \u003cem\u003eFlavobacterium johnsoniae\u003c/em\u003e were among the most frequently misclassified taxa across all settings. When disregarding relative abundances, kMetaShot at the MAGs level proved to be the most precise method for taxonomic identification within the community (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e5\u003c/span\u003eA). This result is particularly notable when compared to Kaiju on nr euk +, which reported nearly 1600 erroneous species, and Kraken2 (with confidence threshold set at 0.99) which reported approximately 600 erroneous species (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e5\u003c/span\u003eB). However, it is important to emphasize that most of Kaiju\u0026rsquo;s misclassifications occurred at very low relative abundances (less than 0.1%), with the exclusion of \u003cem\u003eThauera\u003c/em\u003e sp. and \u003cem\u003eTetrasphaera\u003c/em\u003e sp., with relative abundances of 1% and 1.5%, respectively. Notably, these species misclassifications still belong to clade actually featured in the mock.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFurthermore, Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e5\u003c/span\u003e highlights the varying sensitivities of the classifier in detecting the true mock species. Kaiju missed only 20 species, followed by Kraken2 with 157 missed species, and lastly, kMetaShot on MAGs. In particular, kMetaShot on MAGs exhibited the lowest sensitivity, missing nearly all of the true species. On the other hand, Kaiju failed to detect phage T4 reads despite being included in its database. On the other hand, Kraken2 recognised phage T4 reads although only with permissive confidences (lower than 0.85). The 15 taxa featured in the mock but not detected by either of the classifications were species belonging to \u003cem\u003eHalomonas\u003c/em\u003e, \u003cem\u003eNovosphingobium, Thauera\u003c/em\u003e and \u003cem\u003eParamecium\u003c/em\u003e genera, although other species of the same genera were identified.\u003c/p\u003e\n\u003ch3\u003eClassification performances of phage T4 and lower metazoan\u003c/h3\u003e\n\u003cp\u003eKaiju successfully identified approximately half of the T4 phage reads when executed using the database containing only viral sequences. However, even under the most stringent settings, over 46\u0026nbsp;million sequences were misclassified as viruses within this focused database. The eukaryote-specific classifier, EukDetect, accurately identified 23 reads of \u003cem\u003eDiploscapter\u003c/em\u003e spp. and did not report any misclassifications after applying its default filtering procedures. However, this high precision came at the cost of a substantial loss in sensitivity, as the majority of eukaryotic sequences remained unclassified. Notably, approximately 300 \u003cem\u003eNovosphingobium aureum\u003c/em\u003e sequences were initially misclassified as the fungus \u003cem\u003eWolfiporia cocos\u003c/em\u003e by the first step of EukDetect, which relies on Bowtie2 alignment against the EukDetect database. Furthermore, Kaiju performed on the custom database constructed exclusively with lower metazoan sequences led to excessive false positives. In fact, when Kaiju was used with such focused database, despite successfully identified \u003cem\u003eParamecium\u003c/em\u003e and \u003cem\u003eDiploscapter\u003c/em\u003e, it also erroneously classified many other nematodes and rotifers from bacterial and human-derived reads. For instance, reads from nearly every bacterial clade included in this mock were misclassified as \u003cem\u003eSteinernema\u003c/em\u003e, and a substantial number of \u003cem\u003eHomo sapiens\u003c/em\u003e reads were mistakenly assigned to nematodes. Although applying high-stringency settings significantly reduced these false positives, Kaiju's precision on the lower metazoan database remained relatively low. On the other hand, Kaiju and Kraken2 with complete databases performed better in terms of overall sensibility (\u003cb\u003esupplementary Fig.\u0026nbsp;3\u003c/b\u003e). In fact, Kaiju with nr euk\u0026thinsp;+\u0026thinsp;was able to identify \u003cem\u003eDiploscapter\u003c/em\u003e and \u003cem\u003eHomo\u003c/em\u003e, maintaining the overall proportions between the clades despite underestimating their relative abundance (\u003cb\u003esupplementary Fig.\u0026nbsp;3\u003c/b\u003e). Moreover, using Kaiju with nr euk\u0026thinsp;+\u0026thinsp;avoided the misclassification of \u003cem\u003eDiploscapter\u003c/em\u003e reads as bacteria (observed with nr euk) while conversely only 137 reads of bacterial genera where incorrectly identified as \u003cem\u003eDiploscapter\u003c/em\u003e with the most stringent settings. However, also other eukaryotic misclassifications were observed with Kaiju using the nr euk\u0026thinsp;+\u0026thinsp;database. For instance, a small fraction of \u003cem\u003eNovosphingobium\u003c/em\u003e and \u003cem\u003ePropionivibrio\u003c/em\u003e-derived reads were misclassified as \u003cem\u003eTrichinella\u003c/em\u003e (0.003%). Similarly, bacterial and \u003cem\u003ePlasmodium\u003c/em\u003e reads were misidentified as fungi (\u003cem\u003eTermitomyces\u003c/em\u003e 0.003%, \u003cem\u003eWolfiporia\u003c/em\u003e 0.001%). Kraken2 with nt core detected \u003cem\u003eHomo\u003c/em\u003e, \u003cem\u003eDiploscapter\u003c/em\u003e, and phage T4, with only trace amounts of \u003cem\u003eDiploscapter\u003c/em\u003e (0.009%) at the most permissive settings. Notably, phage reads were consistently identified even at a confidence threshold of 0.99 while accurately confirming the absence of other viral clades in the mock. However, Kraken2\u0026rsquo;s high sensitivity came at the cost of increased noise, as it misclassified \u003cem\u003eNovosphingobium\u003c/em\u003e and \u003cem\u003eDechloromonas\u003c/em\u003e spp. as \u003cem\u003eWolfiporia\u003c/em\u003e (0.02%) and \u003cem\u003eGallus gallus\u003c/em\u003e (0.1%), respectively, even under the most stringent settings.\u003c/p\u003e \u003cp\u003e \u003cspan type=\"BoldItalicUnderline\" class=\"BoldItalicUnderline\" name=\"Emphasis\"\u003eHomo sapiens\u003c/span\u003e \u003cspan type=\"BoldUnderline\" class=\"BoldUnderline\" name=\"Emphasis\"\u003ereads misclassifications as bacteria and decontamination test\u003c/span\u003e\u003c/p\u003e \u003cp\u003eKraken2 on nt core database correctly identified about half of the \u003cem\u003eHomo sapiens\u003c/em\u003e reads when performed on low confidence thresholds. Moreover, Kraken2 did not misclassify them as bacteria, correctly recognizing at least the correct clade (e.g., \u003cem\u003eHominidae\u003c/em\u003e, \u003cem\u003eBilateraria\u003c/em\u003e, etc.) or, at worst, it misclassified some as monkey-derived reads (e.g., \u003cem\u003eCatarrhini\u003c/em\u003e spp.). In contrast, Kaiju, which performed well overall in the current benchmark, misclassified \u003cem\u003eH. sapiens\u003c/em\u003e reads as bacteria (e.g. \u003cem\u003eEnterococcus\u003c/em\u003e, \u003cem\u003eStaphylococcus\u003c/em\u003e, \u003cem\u003ePseudomonas\u003c/em\u003e, \u003cem\u003eKlebsiella pneumoniae\u003c/em\u003e, \u003cem\u003eAcinetobacter baumannii\u003c/em\u003e and \u003cem\u003eEscherichia coli\u003c/em\u003e) when used with the nr euk database. Using nr euk +, which includes Homo sapiens reads, allowed Kaiju to correctly identify few \u003cem\u003eHomo\u003c/em\u003e reads (less than 10%) but, more importantly, to not mistaken them as bacteria. However, Kaiju frequently misidentified \u003cem\u003eH. sapiens\u003c/em\u003e reads as \u003cem\u003ePlasmodium ovale\u003c/em\u003e with both its databases. These misclassifications were consistent across the different Kaiju settings, although they were significantly reduced with the most stringent parameters (\"E\u0026thinsp;=\u0026thinsp;0.0001 and m\u0026thinsp;=\u0026thinsp;42\") and when using the nr euk\u0026thinsp;+\u0026thinsp;database. While the total number of \u003cem\u003eH. sapiens\u003c/em\u003e reads misclassified by Kaiju was relatively low (10.543 read pairs with nr euk and only 75 with nr euk +, under the most stringent parameters), such errors could lead to incorrect assumptions regarding the presence of certain rare taxa in the community. To address this issue, various decontamination methods for \u003cem\u003eH. sapiens\u003c/em\u003e reads, as well as other likely eukaryotic DNA residuals originating from real wastewater, were tested before microbial community classification (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e6\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAmong the tested methods, Bowtie2 demonstrated the highest sensitivity in identifying \u003cem\u003eH. sapiens\u003c/em\u003e reads while also reporting a relatively low number of false positives (i.e., microbial reads misclassified as \u003cem\u003eHomo sapiens\u003c/em\u003e), particularly when used with end-to-end alignments which captured about 5000 paired reads misidentified as \u003cem\u003eHomo\u003c/em\u003e. Specifically, Bowtie2's false positives primarily consisted of misclassified \u003cem\u003ePropionivibrio, Novosphingobium, Paramecium\u003c/em\u003e and \u003cem\u003eDechloromonas\u003c/em\u003e reads. Although slightly less effective, Kraken2 on the GRCh38 human database with a confidence threshold of 0.45 showed comparable performance. Kraken2 surpassed Bowtie2 in precision when the confidence threshold was increased. In fact, Kraken2 misclassified only around 200 reads, mainly from \u003cem\u003eNovosphingobium\u003c/em\u003e and \u003cem\u003eDechloromonas\u003c/em\u003e, as \u003cem\u003eH. sapiens\u003c/em\u003e at a confidence level of 0.99. However, this improvement in precision came at the expense of a significant reduction in sensitivity, as only about one-third of the true \u003cem\u003eH. sapiens\u003c/em\u003e reads (approximately 70.000 out of 240.000) were correctly identified. Moreover, when Kraken2 decontamination was performed using a broader eukaryotic database, the total number of misclassifications increased substantially, resulting in nearly equal numbers of false positives and true positives. Specifically, a large proportion of reads originating from \u003cem\u003eFlavobacterium\u003c/em\u003e were misclassified as \u003cem\u003eOstrinia furnacalis\u003c/em\u003e (a hexapod, i.e. an insect), while many \u003cem\u003eNovosphingobium\u003c/em\u003e derived reads were incorrectly identified as belonging to the plant genus \u003cem\u003eElaeis\u003c/em\u003e, regardless of the confidence threshold employed.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eCurrent research efforts continue to investigate microbiomes using available sequencing technologies and bioinformatics workflows, however, many of these tools have been validated primarily for human-associated microbiota. Accordingly, this study aimed at systematically evaluate the advantages and limitations of commonly used taxonomic classification approaches following short-read DNA sequencing.\u003c/p\u003e \u003cp\u003eA considerable proportion of sequencing reads remained unclassified, even though the genomes composing the mock community were sourced from public databases, highlighting intrinsic classification limitations. The proportion of both unclassified and misclassified reads is expected to increase in real samples, given the higher complexity of real microbial communities and the presence of bacteria not represented in current databases.\u003c/p\u003e \u003cp\u003eAmong the classifiers tested, Kaiju with either nr euk and nr euk\u0026thinsp;+\u0026thinsp;demonstrated the best performance, capturing the relative abundance ratios of the most prevalent genera and species. However, approximately 25% of its classifications were incorrect, with the majority assigned as \"cannot be assigned to a non-generic virus\". Such classifications provide limited taxonomic resolution and are nearly as uninformative as the \"unclassified\" reads. While numerous misclassifications occurred with Kaiju, they were predominantly at very low relative abundances. Most of the species misclassified by Kaiju belonged to genera actually featured in the mock community, meaning that the misclassifications were taxonomically close to the expected assignment. The advantages conferred by Kaiju may stem from its in silico translation approach, which mitigates the impact of single nucleotide errors or mutations on the taxonomic classifications. Although protein databases lack non-coding genomic regions, Kaiju is expected to be relatively effective on bacterial and viral genomes as they predominantly consist of coding sequences [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003ekMetaShot ranked as the second-best classifier in terms of overall efficiency. However, it primarily identified only the most abundant taxa while maintaining a high degree of accuracy, meaning that it obtained a high precision at cost of sensitivity. Its lack of sensitivity may be attributed to its database construction methodology, as it was primarily tested on human-associated environments rather than environmental microbiomes [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. It is noteworthy that kMetaShot performed on MAGs assembled from short reads, hence both its sensibility and precision are expected to significantly increase in case of long read sequencing.\u003c/p\u003e \u003cp\u003eRiboFrame\u0026rsquo;s estimation of the mock community was almost precise as Kaiju, but exhibited few misclassifications and overestimated \u003cem\u003eFlavobacterium\u003c/em\u003e relative abundance, may due to the higher copy number of the 16S rRNA gene in its genome compared to other bacteria such as \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Kraken2, when used with the SILVA database, produced unreliable results, while RiboFrame successfully utilized the same database with minimal noise. Notably, RiboFrame had the lowest RAM requirements, confirming its suitability for short-read DNA sequencing analysis when high-performance computers are not available.\u003c/p\u003e \u003cp\u003eKraken2 used on nt core database pictured a community similar to the mock, but its performance was inferior to the other tested classifiers. This outcome was unexpected, given that Kraken2 is frequently reported as one of the top-performing classifiers in human and soil microbiome studies [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Nevertheless, Kraken2 effectiveness was still confirmed as it managed to obtain unique insights, being the only classifier that successfully detected all the true genera. In detail, at a confidence threshold of 0.05 with the nt core database, Kraken2 exhibited over 25% misclassifications but still managed to identify all clades present in the mock, including the T4 phage, albeit with incorrect abundance estimations for \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter, Zoogloea, and \u003cem\u003eCandidatus\u003c/em\u003e Competibacter (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The observed inaccuracies are likely attributable to the database rather than the classifier itself, as many microbes associated with AS and AGS systems lack reliable reference genomes. In fact, inspecting the Kraken2 nt core highlights the under representations of many \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter and \u003cem\u003eCandidatus\u003c/em\u003e Competibacter species. This limitation was already reported for other Kraken 2 official databases, for example Calder\u0026oacute;n-Franco et al. found that many AGS related taxa are poorly annotated in Kraken2 standard database [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAll classifiers performed poorly when applied to contigs, suggesting suboptimal assembly. Specifically, contig-based classifications resulted in significant underestimation of many clades among which \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter and \u003cem\u003eCandidatus\u003c/em\u003e Competibacter, while overestimating others as \u003cem\u003eNovosphingobium\u003c/em\u003e. Nevertheless, the contigs served as the basis for generating MAGs, which were classified with high accuracy using kMetaShot. Such contrasting outcomes suggests that the potential information obtained by assembling MAGs was greater than the noise obtained from the assembling in contigs. The most accurate MEGAHIT assembly setting resulted to be the \"metalarge\" mode, albeit with a marginal improvement.\u003c/p\u003e \u003cp\u003eAs anticipated, lowering the confidence threshold increased the error rate with every classifier. However, the trade-off between reducing noise and losing valuable information was not favourable. For instance, at lower confidence thresholds, Kraken2 began to miss key species, suggesting that an optimal range its classification lies between 0.05 and 0.3 for AS and AGS related environments. Similar trends were observed for RiboFrame and kMetaShot. Conversely, Kaiju exhibited minimal changes when increasing stringency (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Thus, increasing the minimal alignment length threshold (\"m\") beyond 40 in Kaiju is suggested to further reduce its misclassifications without major losses in sensibility. However, this may result also in minimal loss of sensitivity regards \u003cem\u003eRotifera\u003c/em\u003e, as more than 15.000 proteins known in this clade are shorter than 40 amino-acids according to the actual UniRef database [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe classifier EukDetect2 exhibited perfect precision but suffered from extremely low sensitivity, as only few reads were recognised as \u003cem\u003eDiploscapter\u003c/em\u003e. Similarly, Kaiju when used with the nr euk\u0026thinsp;+\u0026thinsp;database effectively detected eukaryotic sequences with good accuracy, albeit missing many. The inclusion of eukaryotic sequences in the classification pipeline was beneficial for every tested classifier. For example, a Kaiju-specific database containing only viral or eukaryotic sequences enhanced the classifier sensitivity but mostly increased its false positive ratios, even classifying many bacteria as eukaryotes. In contrast, Kaiju on nr euk rarely classified \u003cem\u003eHomo\u003c/em\u003e or \u003cem\u003eDiploscapter\u003c/em\u003e sequences as bacterial or vice versa, and was even more precise when using the nr euk +. Similarly, Kraken2 misclassified nearly all human-derived reads into the correct broad clade when using its complete database. Conversely, a comprehensive yet incomplete or unfocused database may result in a significant loss of information, as classifiers are more likely to assign reads to clades that are not actually present in the sampled environment. Such limit was observed when Kraken2 was used on the eukaryote custom database leading to numerous false positives, such as misclassifications of bacteria as insects. It is important to emphasize that the list of eukaryotes included such custom database should not be considered to be exhaustive of possible eukaryotic contaminants in waste water, but rather as a benchmark for potential misclassifications of microbial sequences.\u003c/p\u003e \u003cp\u003eOn the other hand, Kraken2 demonstrated superior accuracy in distinguishing human reads from bacterial sequences when using only the GRCh38 database at maximum confidence. Despite Bowtie2 achieved a significantly greater sensitivity in identify \u003cem\u003eHomo\u003c/em\u003e reads in our simulations, also mistaken more microbial reads as human compared to Kraken2 used with 0.99 confidence. The decontamination prior to the classification would further reduce false positive classifications, as the \u003cem\u003eHomo sapiens\u003c/em\u003e sourced reads are often mistaken for \u003cem\u003ePlasmodium ovale\u003c/em\u003e in our simulated scenario. The likelihood of \u003cem\u003eHomo sapiens\u003c/em\u003e DNA misclassifications were already reported in literature, for example Marcelino, Holmes and Sorrell highlighted the illogic inferring of reptiles from human gut DNA samples [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. However, given the low misclassification rate of \u003cem\u003eHomo sapiens\u003c/em\u003e reads with Kaiju (\u003cem\u003enr euk\u0026thinsp;+\u003c/em\u003e\u0026thinsp;database, stringent settings), decontamination should be carefully considered to avoid losing valuable microbial reads due to rare false positives \u003cem\u003eHomo\u003c/em\u003e reads. Consequently, the optimal strategy may depend on sequencing depth (as bacterial reads are typically more abundant than animal or plant derived contaminants) and the nature of the influent feeding the reactor (i.e. real or synthetic wastewater). In real wastewater influent scenarios, particularly those originating from domestic sources, in silico decontamination of human reads using Kraken2 with a focused database at a high confidence threshold may be a viable strategy.\u003c/p\u003e \u003cp\u003eOverall, the results highlighted the risks of placing blind trust in classification outputs, particularly when interpreting low-abundance taxa. For instance, rare \u003cem\u003eDechloromonas\u003c/em\u003e reads were erroneously classified as \u003cem\u003eGallus gallus\u003c/em\u003e despite the application of high stringency thresholds, and fungal taxa were inferred despite their absence from the simulated community. While the former misclassification might be reasonably disregarded in practical scenarios due to its implausibility, the latter could misleadingly suggest the presence of fungi in the reactor. Due to the intentionally simplified design of the simulated mock community, it is not possible to define abundance thresholds. Nonetheless, the application of filtering thresholds previously proposed in the literature, such as 0.005% at species level [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] (calculated including unclassified reads in the total) is still suggested.\u003c/p\u003e "},{"header":"Methods","content":"\u003cdiv id=\"Sec9\" class=\"Section3\"\u003e \u003ch2\u003eMock generation\u003c/h2\u003e \u003cp\u003eReference genomes of 14 bacterial species frequently observed in AGS and activated sludge microbial communities were downloaded from NCBI RefSeq using NCBI Datasets v16.22.0. Additionally, genomes of \u003cem\u003eCandidatus Moranbacteria\u003c/em\u003e and \u003cem\u003eSolirubrobacter bacterium 67\u0026thinsp;\u0026minus;\u0026thinsp;14\u003c/em\u003e [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e], which have also been reported in AGS and activated sludge studies, were retrieved from GenBank, as these genera lack official reference genomes. To incorporate microbial eukaryotes and bacteriophages, the reference genomes of \u003cem\u003eDiploscapter\u003c/em\u003e spp., \u003cem\u003eParamecium\u003c/em\u003e spp. and a \u003cem\u003eT4 bacteriophage\u003c/em\u003e species were also included. Notably, Paramecium spp. were selected as they are the only ciliates with reference genomes available in the NCBI RefSeq database as well as members of the Vorticellaceae family that currently lack genomic data in both RefSeq and GenBank. Furthermore, the reference genome of \u003cem\u003eHomo sapiens\u003c/em\u003e was downloaded to account for potential traces of eukaryotic DNA originating from reactor influents. In total, genomes from 20 taxa (16 bacteria, 3 eukaryotes, and 1 virus) were collected. The full list of selected genera is provided in Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eList of genera featured in the mock. The column \"Synonym\" indicates taxon names as listed in the NCBI database when they differ from those in other databases (e.g. SILVA). The \"Read Counts\" column presents the raw number of paired reads assigned to each clade in the mock dataset, while \"Read Percentages\" represents their relative abundance.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDomain\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGenus\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSynonym\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eReads Counts\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eReads Percentages\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e7500159\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eCandidatus\u003c/em\u003e Competibacter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e7499920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e15\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eThauera\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e6000210\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eFlavobacterium\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4000000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eCandidatus\u003c/em\u003e Moranbacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eCandidatus\u003c/em\u003e Moraniibacteriota\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3999999\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eDechloromonas\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2500038\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eNitrosomonas\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2500189\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eZoogloea\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1999647\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003ePropionivibrio\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eCandidatus\u003c/em\u003e Propionivibrio \u003c/p\u003e \u003cp\u003e(at species level)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1999980\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eNovosphingobium\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2000782\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eTetrasphaera\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eNostocoides\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1999944\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eAzoarcus\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1999998\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eNitrobacter\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1999951\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eDelftia\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1999998\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003e67\u0026thinsp;\u0026minus;\u0026thinsp;14\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eSolirubrobacterales bacterium\u003c/em\u003e 67\u0026thinsp;\u0026minus;\u0026thinsp;14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e500004\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBacteria\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eHalomonas\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e500688\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003enone\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ephage T4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eTequatrovirus\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e250000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEukaryota\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eHomo\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e250275\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEukaryota\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eDiploscapter\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e249997\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEukaryota\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eParamecium\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e249980\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eSimulated untargeted sequencing of these genomes was performed using InSilicoSeq (ISS) v2.01 [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], emulating sequencing via Illumina NovaSeq.\u0026nbsp;This resulted in 150 bp paired-end reads at a total depth of 50\u0026nbsp;million paired reads. The seed 1994 was employed to ensure the full reproducibility of the results (see data availability). The sequencing simulation was designed to generate precise relative abundances for each taxon, as detailed in Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eNotably, the simulated mock community comprises a few predominant bacterial taxa, with others taxa present at lower abundances, thereby reflecting realistic microbial community structures. In particular, \u003cem\u003eCandidatus\u003c/em\u003e Accumulibacter and \u003cem\u003eCandidatus\u003c/em\u003e Competibacter were the most abundant bacteria, leading to an abundance distribution r resembling AGS communities more than AS communities [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. For sake of reading simplicity, the genera featured in the mock will be referred to as \"true genera\" in this paper.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e\n\u003ch3\u003eProcessing and classifying the mock reads\u003c/h3\u003e\n\u003cp\u003eThe mock reads were filtered through using BBDuk (module of BBTools suit version 39.06) [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] to remove reads sourced from Illumina adapters or phiX, very-low complexity sequences with entropy value less than 0.01 (\"entropy\u0026thinsp;=\u0026thinsp;0.01\"), 3' ends regions with Q-score lower than 20 (\"qtrim\u0026thinsp;=\u0026thinsp;r\",\"trimq\u0026thinsp;=\u0026thinsp;20\") and reads shorter than 100 bp (\"minlen\u0026thinsp;=\u0026thinsp;100\") while taking into account the in paired-end nature of the sequencing (\"tpe\", \"tpo\"). This pre-processing step was intentionally disregarded when comparing the estimated abundances with the known original ones, in order to incorporate actual sequencing biases into this benchmark. The filtered reads were classified as such or after being assembled into contigs or metagenome-assembled genomes (MAGs). Contigs were assembled using MEGAHIT v1.2.9 [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] under three different settings: \"default\", \"meta-large\" and \u0026ldquo;custom\u0026rdquo; (the latter employing a k-mer list of 35, 57, 79, 99, as used in the kMetaShot study [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]). MAGs were subsequently reconstructed from the contigs for each setting using MetaBat v2.17 [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. The MAG assembly followed the same settings as in the kMetaShot study [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] to ensure full compatibility with this classifier, as the MAGs identification was tested exclusively with kMetaShot.\u003c/p\u003e \u003cp\u003eThe classification was carried out with the widely used Kraken v2.1.2 [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] with various confidence levels, Kaiju v1.10 [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] with different E-value and minimal coverage thresholds, RiboFrame v 1.0 [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] with different confidence thresholds applied to both the full-length 16S rDNA and its V3-V4 hypervariable region featured among the reads, and kMetaShot v1.0 [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] with multiple confidence levels. Bracken 2.7.0 [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] was used to re-estimate the abundances of the Kraken2 identified taxa according to their genome length and sequenced read length. In particular, the classification at read level was conducted with RiboFrame, Kaiju, and Kraken2, at contigs level with Kaiju, Kraken2, and kMetaShot and at MAG level with kMetaShot. The analysis at the contig level using kMetaShot was conducted for each kMetaShot setting described above, whereas classifications with Kaiju and Kraken2 (on \"nt core\" database) were performed exclusively on contigs generated with the MEGAHIT \u0026lsquo;metalarge\u0026rsquo; option to avoid unnecessarily convoluted comparisons between the various settings combinations. Moreover, the confidence thresholds were not used when classifying at contig level with kMetaShot as almost every related confidence score was near zero.\u003c/p\u003e \u003cp\u003eKraken2 was used with both the \"nt core\" (built on December 28, 2024) and SILVA 138 official databases. Kaiju was tested with the \"nr euk\" and \"nr euk plus\" (referred to as \"nr euk +\") databases. The nr euk database, built in October 2023, is the most recent official distribution including bacteria, archaea, viruses, protozoa and fungi. In contrast, nr euk\u0026thinsp;+\u0026thinsp;is a customized version of this database, built with the most recent NCBI nr available (April 2024) and expanded to incorporate nr sequences from Platyhelminthes, Nematoda, Amoeba, Rotifera, Tardigrada, and \u003cem\u003eHomo sapiens\u003c/em\u003e. RiboFrame relied on the RDP classifier retrained on SILVA SSU 138. kMetaShot employed its own database, downloaded in February 2025. Clades without official genus name in NCBI (e.g. \u003cem\u003eCandidatus\u003c/em\u003e Moranbacteria) were obtained from the species classifications and added to the genus level outputs to reduce the database biases in the genus level comparison. A comprehensive list of all program, parameter, and database combinations used in this analysis is provided in \u003cb\u003eSupplementary table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u003c/b\u003e.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eComparison between classifiers outcomes\u003c/h2\u003e \u003cp\u003eThe estimated microbial abundances in the mock datasets were compared across different settings using R v4.3, with the packages vegan v2.6.4 [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] and ecodist v2.1.3 [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. Data visualization was performed using ggplot v3.4.4, ggvenn v0.1.10, and ggh4x v0.2.7. Synonymies across the employed databases were manually resolved, at least for the known genera included in the mock and the most abundant misclassifications, through accurate searches in List of Prokaryotic names with Standing in Nomenclature (LPSN) database [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. Importantly, the unclassified reads were not included in the percent abundances computation, hence the analysis was focused on the classifier-specific classifications. Principal Coordinate Analysis (PCoA) was conducted using the Hellinger distance, i.e. the Euclidean distance applied to Hellinger-transformed abundances, to account for the sparse and compositional nature of the data [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. Additionally, the Bray-Curtis dissimilarity index, applied to proportional data, was used as an alternative ecological measure to ensure that the PCoA related conclusions were not influenced by the choice of ecological distance. The most abundant misclassifications for each setting were identified by computing the average abundances of taxa that were incorrectly assigned as not actually present in the mock. All the analyses were primarily conducted at the genus level across all described settings, with additional species-level insights obtained by comparing Kaiju outputs at the reads level (using both the databases with settings E\u0026thinsp;=\u0026thinsp;0.00001 and m\u0026thinsp;=\u0026thinsp;42), Kraken2 at the reads level (using the nt core database with confidence thresholds of 0.15 and 0.99) and kMetaShot at the MAGs level (after contings assembly through MEGAHIT with \u0026ldquo;metalarge\u0026rdquo; option). These programs and settings were specifically chosen for the comparison at species levels as theoretically capable of providing such taxonomic detail and due to their generally accurate performances at genus level.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eFocus on non-prokaryotes derived reads\u003c/h2\u003e \u003cp\u003eIn addition to the listed software and parameter combinations used for classifying the bulk community, additional analyses were conducted to specifically assess potential misclassifications of non-prokaryotic reads. Read-level classification was performed using Kaiju v1.10 with the pre-built viral sequence database from RefSeq to further investigate false negative classifications of this clade observed in the full database. Additionally, Kaiju was executed with a custom database constructed by selecting only common metazoan sequences found in activated sludge (i.e. Rotifers, Platyhelminths, Nematodes, Amoebae and Tardigrades) from the UniRef100 protein database. Furthermore, EukDetect v1.3 [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e] was applied to unfiltered reads using its default database, EukDetect database v9, which has included lower metazoans since recent releases. Finally, an additional attention was spent on misclassifications of Homo sapiens reads as bacteria. The identification of Homo sapiens reads (as optional decontamination step prior to the actual microbes\u0026rsquo; classification) was performed with Kraken2 on both GRCh38 reference genome [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e] and a custom database on diverse confidence levels, and with Bowtie2 [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] in paired-end mode with the \u0026ldquo;very-sensitive\u0026rdquo; option using both local alignment and end-to-end alignment. The custom database used in Kraken2 was constructed from the reference genomes of various higher eukaryotes whose residual DNA fragments are likely to be present in waste water feeding AGS and AS reactors, including Hexapoda, Annelida, Chlorophyta, plants (Kraken2 reference sequences), \u003cem\u003eHomo sapiens\u003c/em\u003e and \u003cem\u003eMus musculus\u003c/em\u003e.\u003c/p\u003e \u003c/div\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e\u003cu\u003eData availability\u003c/u\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe simulated mock community raw FASTQ are publicly available on NCBI SRA with the accession code PRJNA1252002. The resulting counts for each classifier, the R data containing the feature table ready for the analysis and the scripts are available at https://github.com/LeandroD94/Papers/tree/main/2025_Benchmark_DNAseq_classifiers_AGS_and_AS .\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cu\u003eAuthor contributions\u003c/u\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eD.L.: design, analysis, writing. R.M.: supervision, review and editing.\u003cbr\u003e\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eRobles, \u0026Aacute;. et al. New frontiers from removal to recycling of nitrogen and phosphorus from wastewater in the Circular Economy. \u003cem\u003eBioresour. Technol.\u003c/em\u003e \u003cb\u003e300\u003c/b\u003e, 122673. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.biortech.2019.122673\u003c/span\u003e\u003cspan address=\"10.1016/j.biortech.2019.122673\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCampo, R. et al. Efficient carbon, nitrogen and phosphorus removal from low C/N real domestic wastewater with aerobic granular sludge. \u003cem\u003eBioresour. Technol.\u003c/em\u003e \u003cb\u003e305\u003c/b\u003e, 122961. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.biortech.2020.122961\u003c/span\u003e\u003cspan address=\"10.1016/j.biortech.2020.122961\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, Y. et al. A review of the phosphorus removal of polyphosphate-accumulating organisms in natural and engineered systems. \u003cem\u003eSci. Total Environ.\u003c/em\u003e \u003cb\u003e912\u003c/b\u003e, 169103. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.scitotenv.2023.169103\u003c/span\u003e\u003cspan address=\"10.1016/j.scitotenv.2023.169103\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWinkler, M. K. H. et al. An integrative review of granular sludge for the biological removal of nutrients and recalcitrant organic matter from wastewater. \u003cem\u003eChem. Eng. J.\u003c/em\u003e \u003cb\u003e336\u003c/b\u003e, 489\u0026ndash;502. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cej.2017.12.026\u003c/span\u003e\u003cspan address=\"10.1016/j.cej.2017.12.026\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSu, J. F., Li, G. Q., Huang, T. L. \u0026amp; Xue, L. The mixotrophic denitrification characteristics of Zoogloea sp. L2 accelerated by the redox mediator of 2-hydroxy-1,4-naphthoquinone. \u003cem\u003eBioresour. Technol.\u003c/em\u003e \u003cb\u003e311\u003c/b\u003e, 123533. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.biortech.2020.123533\u003c/span\u003e\u003cspan address=\"10.1016/j.biortech.2020.123533\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhang, M., Li, A., Yao, Q., Xiao, B. \u0026amp; Zhu, H. Pseudomonas oligotrophica sp. nov., a Novel Denitrifying Bacterium Possessing Nitrogen Removal Capability Under Low Carbon\u0026ndash;Nitrogen Ratio Condition. Volume 13\u0026ndash;2022, (2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3389/fmicb.2022.882890\u003c/span\u003e\u003cspan address=\"10.3389/fmicb.2022.882890\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYe, J. et al. Denitrifying communities enriched with mixed nitrogen oxides preferentially reduce N2O under conditions of electron competition in wastewater. \u003cem\u003eChem. Eng. J.\u003c/em\u003e \u003cb\u003e498\u003c/b\u003e, 155292. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.cej.2024.155292\u003c/span\u003e\u003cspan address=\"10.1016/j.cej.2024.155292\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWil\u0026eacute;n, B. M., Li\u0026eacute;bana, R., Persson, F., Modin, O. \u0026amp; Hermansson, M. The mechanisms of granulation of activated sludge in wastewater treatment, its optimization, and impact on effluent quality. \u003cem\u003eAppl. Microbiol. Biotechnol.\u003c/em\u003e \u003cb\u003e102\u003c/b\u003e, 5005\u0026ndash;5020. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s00253-018-8990-9\u003c/span\u003e\u003cspan address=\"10.1007/s00253-018-8990-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEkholm, J. et al. Microbiome structure and function in parallel full-scale aerobic granular sludge and activated sludge processes. \u003cem\u003eAppl. Microbiol. Biotechnol.\u003c/em\u003e \u003cb\u003e108\u003c/b\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s00253-024-13165-8\u003c/span\u003e\u003cspan address=\"10.1007/s00253-024-13165-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFalcioni, S. et al. in \u003cem\u003eResource Recovery from Wastewater Treatment.\u003c/em\u003e (eds Giorgio Mannina, Alida Cosenza, \u0026amp; Antonio Mineo) 140\u0026ndash;146 (Springer Nature Switzerland).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAdewale, B. A. Will long-read sequencing technologies replace short-read sequencing technologies in the next 10 years? \u003cem\u003eAfr. J. Lab. Med.\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.4102/ajlm.v9i1.1340\u003c/span\u003e\u003cspan address=\"10.4102/ajlm.v9i1.1340\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBush, S. J., Connor, T. R., Peto, T. E. A., Crook, D. W. \u0026amp; Walker, A. S. Evaluation of methods for detecting human reads in microbial sequencing datasets. \u003cem\u003eMicrob. genomics\u003c/em\u003e. \u003cb\u003e6\u003c/b\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1099/mgen.0.000393\u003c/span\u003e\u003cspan address=\"10.1099/mgen.0.000393\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChorlton, S. D. Ten common issues with reference sequence databases and how to mitigate them. \u003cem\u003eFront. Bioinf.\u003c/em\u003e \u003cb\u003e4\u003c/b\u003e, 1278228. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3389/fbinf.2024.1278228\u003c/span\u003e\u003cspan address=\"10.3389/fbinf.2024.1278228\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMenzel, P., Ng, K. L. \u0026amp; Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cb\u003e7\u003c/b\u003e, 11257. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/ncomms11257\u003c/span\u003e\u003cspan address=\"10.1038/ncomms11257\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWood, D. E., Lu, J. \u0026amp; Langmead, B. Improved metagenomic analysis with Kraken 2. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cb\u003e20\u003c/b\u003e, 257. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s13059-019-1891-0\u003c/span\u003e\u003cspan address=\"10.1186/s13059-019-1891-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRamazzotti, M., Bern\u0026aacute;, L., Donati, C. \u0026amp; Cavalieri, D. riboFrame: An Improved Method for Microbial Taxonomy Profiling from Non-Targeted Metagenomics. \u003cem\u003eFront. Genet.\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e, 329. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3389/fgene.2015.00329\u003c/span\u003e\u003cspan address=\"10.3389/fgene.2015.00329\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDefazio, G., Tangaro, M. A., Pesole, G. \u0026amp; Fosso, B. kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes. \u003cem\u003eBrief. Bioinform.\u003c/em\u003e \u003cb\u003e26\u003c/b\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/bib/bbae680\u003c/span\u003e\u003cspan address=\"10.1093/bib/bbae680\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDi Gloria, L. et al. Experimental tests challenge the evidence of a healthy human blood microbiome. \u003cem\u003eFEBS J.\u003c/em\u003e \u003cb\u003e292\u003c/b\u003e, 796\u0026ndash;808. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1111/febs.17362\u003c/span\u003e\u003cspan address=\"10.1111/febs.17362\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLangmead, B. \u0026amp; Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. \u003cem\u003eNat. Methods\u003c/em\u003e. \u003cb\u003e9\u003c/b\u003e, 357\u0026ndash;359. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/nmeth.1923\u003c/span\u003e\u003cspan address=\"10.1038/nmeth.1923\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDueholm, M. K. D. et al. MiDAS 5: Global diversity of bacteria and archaea in anaerobic digesters. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cb\u003e15\u003c/b\u003e, 5361. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41467-024-49641-y\u003c/span\u003e\u003cspan address=\"10.1038/s41467-024-49641-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEdwin, N. R., Fitzpatrick, A. H., Brennan, F., Abram, F. \u0026amp; O\u0026rsquo;Sullivan, O. An in-depth evaluation of metagenomic classifiers for soil microbiomes. \u003cem\u003eEnviron. Microbiome\u003c/em\u003e. \u003cb\u003e19\u003c/b\u003e, 19. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s40793-024-00561-w\u003c/span\u003e\u003cspan address=\"10.1186/s40793-024-00561-w\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCalder\u0026oacute;n-Franco, D. et al. Metagenomic profiling and transfer dynamics of antibiotic resistance determinants in a full-scale granular sludge wastewater treatment plant. \u003cem\u003eWater Res.\u003c/em\u003e \u003cb\u003e219\u003c/b\u003e, 118571. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.watres.2022.118571\u003c/span\u003e\u003cspan address=\"10.1016/j.watres.2022.118571\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThe UniProt, C. UniProt: the Universal Protein Knowledgebase in 2025. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e53\u003c/b\u003e, D609\u0026ndash;D617. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/nar/gkae1010\u003c/span\u003e\u003cspan address=\"10.1093/nar/gkae1010\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmos, G. C. A. et al. Developing standards for the microbiome field. \u003cem\u003eMicrobiome\u003c/em\u003e \u003cb\u003e8\u003c/b\u003e, 98. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s40168-020-00856-3\u003c/span\u003e\u003cspan address=\"10.1186/s40168-020-00856-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGu, Y., Li, B., Zhong, X., Liu, C. \u0026amp; Ma, B. Bacterial Community Composition and Function in a Tropical Municipal Wastewater Treatment Plant. \u003cb\u003e14\u003c/b\u003e, 1537 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXin, Z., Yang, L. \u0026amp; Yang, L. Divergences of granules and flocs microbial communities and contributions to nitrogen removal under varied carbon to nitrogen ratios. \u003cem\u003eBioresour. Technol.\u003c/em\u003e \u003cb\u003e425\u003c/b\u003e, 132226. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.biortech.2025.132226\u003c/span\u003e\u003cspan address=\"10.1016/j.biortech.2025.132226\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGourl\u0026eacute;, H., Karlsson-Lindsj\u0026ouml;, O., Hayer, J. \u0026amp; Bongcam-Rudloff, E. Simulating Illumina metagenomic data with InSilicoSeq. \u003cem\u003eBioinformatics (Oxford, England)\u003c/em\u003e 35, 521\u0026ndash;522, (2018). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/bioinformatics/bty630%J\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/bty630%J\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e Bioinformatics.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBushnell, B., Rood, J. \u0026amp; Singer, E. BBMerge \u0026ndash; Accurate paired shotgun read merging via overlap. \u003cem\u003ePLoS ONE\u003c/em\u003e. \u003cb\u003e12\u003c/b\u003e, e0185056\u0026ndash;e0185056. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1371/journal.pone.0185056\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0185056\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, D., Liu, C. M., Luo, R., Sadakane, K. \u0026amp; Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. \u003cem\u003eBioinf. (Oxford England)\u003c/em\u003e. \u003cb\u003e31\u003c/b\u003e, 1674\u0026ndash;1676. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/bioinformatics/btv033\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/btv033\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. \u003cem\u003ePeerJ\u003c/em\u003e \u003cb\u003e7\u003c/b\u003e, e7359. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.7717/peerj.7359\u003c/span\u003e\u003cspan address=\"10.7717/peerj.7359\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLu, J., Breitwieser, F. P., Thielen, P., Salzberg, S. L. \u0026amp; Bracken Estimating species abundance in metagenomics data. \u003cb\u003e051813\u003c/b\u003e, (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1101/051813%J bioRxiv\u003c/span\u003e\u003cspan address=\"10.1101/051813%J bioRxiv\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVegan Community Ecology Package (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoslee, S. C. \u0026amp; Urban, D. L. The ecodist Package for Dissimilarity-based Analysis of Ecological Data. \u003cem\u003eJ. Stat. Softw.\u003c/em\u003e \u003cb\u003e22\u003c/b\u003e, 1\u0026ndash;19. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18637/jss.v022.i07\u003c/span\u003e\u003cspan address=\"10.18637/jss.v022.i07\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2007).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eParte, A. C., Sard\u0026agrave; Carbasse, J., Meier-Kolthoff, J. P., Reimer, L. C. \u0026amp; G\u0026ouml;ker, M. List of Prokaryotic names with Standing in Nomenclature (LPSN) moves to the DSMZ. \u003cb\u003e70\u003c/b\u003e, 5607\u0026ndash;5612, (2020). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1099/ijsem.0.004332\u003c/span\u003e\u003cspan address=\"10.1099/ijsem.0.004332\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLegendre, P. \u0026amp; Legendre, L. J. D. i. E. M. Chapter 7 \u0026ndash; Ecological resemblance. 24, 265\u0026ndash;335 (2012).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLind, A. L. \u0026amp; Pollard, K. S. Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing. \u003cem\u003eMicrobiome\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e, 58. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s40168-021-01015-y\u003c/span\u003e\u003cspan address=\"10.1186/s40168-021-01015-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. 072116, (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1101/072116%J bioRxiv\u003c/span\u003e\u003cspan address=\"10.1101/072116%J bioRxiv\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Wastewater, Microbial community, Classifications, Aerobic Granular Sludge, Benchmark","lastPublishedDoi":"10.21203/rs.3.rs-6485360/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6485360/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBiological wastewater treatment processes, such as activated sludge (AS) and aerobic granular sludge (AGS), have proven to be crucial systems for achieving both efficient waste purification and the recovery of valuable resources like poly-hydroxy-alkanoates (PHA). Gaining a deeper understanding of the microbial communities underpinning these technologies would enable their optimization, ultimately reducing costs and increasing efficiency. To support this research, we quantitatively compared classification methods differing in read length (raw reads, contigs and MAGs), overall search approach (Kaiju, Kraken2, RiboFrame and kMetaShot), as well as source databases to assess the classification performances at both the genus and species levels using an in silico-generated mock community designed to provide a simplified yet comprehensive representation of the complex microbial ecosystems found in AS and AGS.\u003c/p\u003e \u003cp\u003eParticular attention was given to the misclassification of eukaryotes as bacteria and vice versa, as well as the occurrence of false negatives. Notably, Kaiju emerged as the most accurate classifier at both the genus and species levels, followed by RiboFrame and kMetaShot. However, our findings highlight the substantial risk of misclassification across all classifiers and databases, which could significantly hinder the advancement of these technologies by introducing noises and mistakes for key microbial clades.\u003c/p\u003e","manuscriptTitle":"Testing the limits of short-reads metagenomic classifications programs in waste water treating microbial communities","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-08 11:49:11","doi":"10.21203/rs.3.rs-6485360/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-06-03T07:07:35+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-31T08:46:12+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-18T22:30:02+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"262692131456565146116585291126150030917","date":"2025-05-08T15:05:22+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"76440667966734682932433012788190313080","date":"2025-05-07T11:51:49+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-05-05T15:40:49+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-04-24T19:41:32+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-04-21T18:12:37+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-04-19T15:03:53+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-04-19T14:33:42+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f7a7f9f6-24f2-4514-ba2f-3833a70965ae","owner":[],"postedDate":"May 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":48221064,"name":"Biological sciences/Biological techniques/Bioinformatics"},{"id":48221065,"name":"Biological sciences/Biological techniques/Genomic analysis"},{"id":48221066,"name":"Biological sciences/Biological techniques/Sequencing"},{"id":48221067,"name":"Biological sciences/Biological techniques"},{"id":48221068,"name":"Earth and environmental sciences/Ecology"},{"id":48221069,"name":"Earth and environmental sciences/Environmental sciences"},{"id":48221070,"name":"Biological sciences/Ecology/Ecological genetics"},{"id":48221071,"name":"Biological sciences/Ecology/Microbial ecology"},{"id":48221072,"name":"Biological sciences/Ecology/Molecular ecology"},{"id":48221073,"name":"Biological sciences/Ecology/Restoration ecology"}],"tags":[],"updatedAt":"2025-07-07T16:13:33+00:00","versionOfRecord":{"articleIdentity":"rs-6485360","link":"https://doi.org/10.1038/s41598-025-07734-8","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-07-05 15:58:41","publishedOnDateReadable":"July 5th, 2025"},"versionCreatedAt":"2025-05-08 11:49:11","video":"","vorDoi":"10.1038/s41598-025-07734-8","vorDoiUrl":"https://doi.org/10.1038/s41598-025-07734-8","workflowStages":[]},"version":"v1","identity":"rs-6485360","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6485360","identity":"rs-6485360","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00