B-GUT reference genome database improves biomarker discovery and fungal identification in gut metagenomes | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Method Article B-GUT reference genome database improves biomarker discovery and fungal identification in gut metagenomes Olfat Khannous-Lleiffe, Toni Gabaldón This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6766778/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Accurate taxonomy assignment to sequencing reads is a key step in metagenomic studies, impacting all downstream analyses. The accuracy of this step critically depends on the quality and comprehensiveness of the used reference genome database. While fungi are ubiquitous and relevant in the human microbiome, they are generally poorly represented in current databases. To address this and other limitations, we developed B-GUT, a custom Kraken2 database that integrates i) a broad and curated collection of 2,110 fungal reference genomes; ii) the human telomere-to-telomere reference genome; and iii) two available curated databases for gut-specific bacterial and archaeal genomes. Results Our analysis of publicly available fungal genomes revealed significant contamination and substantial cross-mapping of human sequencing reads to fungal genome references, underscoring the necessity of rigorous curation and accurate host read filtering. We validated our genome curation pipeline and the resulting B-GUT database using mock microbial communities with known compositions. Finally, we showcased the utility of B-GUT by re-analysing data from a published colorectal cancer metagenomics study, where its use led to significantly improved results, providing more precise taxonomic assignments and a more accurate identification of differentially abundant taxa for both bacterial and fungal communities. Conclusions We introduce B-GUT, a reference genome database centered on the gut microbiome, featuring a uniquely curated and comprehensive collection of fungal genomes, often underrepresented in existing resources. We demonstrate the importance of database curation and the enhanced capacity of B-GUT for identifying biologically relevant microbes. Metagenomics Niche-specific database Fungi biomarker discovery colorectal cancer Figures Figure 1 Figure 2 Figure 3 Figure 4 INTRODUCTION Microbiome research has revolutionized our understanding of complex microbial communities and their interactions, impacting diverse fields such as human health and environmental research. Microbiome analysis relies on cutting-edge sequencing technologies, with whole-genome shotgun sequencing (i.e. metagenomics) emerging as the approach of choice when a highly-resolved taxonomic information or functional insights are needed [ 1 ]. A crucial step in metagenomics is taxonomic assignment, the process of inferring the most likely taxonomic affiliation for each sequencing read, which allows determining the presence and abundance of different taxa within a sample. One approach for taxonomic assignment relies on comparing the sequence of reads to those of reference genomes compiled in a database. Importantly, the accuracy and resolution of the taxonomic assignment are inherently dependent on the quality and comprehensiveness of the genome database used for the assignment. However, while much focus is placed on the algorithms to perform these comparisons, standard reference databases often fall short of providing a complete and accurate representation of microbial communities, due to poor representation of certain groups, or to the presence of mis-annotated or contaminated reference genomes [ 1 , 2 ]. Additionally, recent research has shown that the use of so-called niche-specific databases -i.e. databases containing only genomes from species or strains isolated in a given niche, like the human gut - can reduce ambiguity in taxonomic assignments as compared to non-specific databases by exploiting prior knowledge of taxa previously identified in a given niche [ 3 – 5 ]. Finally, existing databases are biased towards prokaryotes and there is a need to include genomic information from microbial eukaryotes. In this regard fungi are ubiquitous components of the human microbiome, where they play key roles despite their generally low relative abundance [ 6 , 7 ]. For instance, fungi inhabiting the gut have been shown to have important impacts in the pathogenesis in compromised individuals, as a cofactor in inflammatory disease, and in the development of colorectal cancer (CRC) [ 8 , 9 ]. One of the most widely used tools for taxonomy assignment in metagenomics is Kraken2 [ 10 ], which is a classification system based on matching k-mers (short subsequences) from sequencing reads to a reference genome database. Some pre-built kraken2 reference genome databases are available in an amazon web service linked to the developer’s website. These references are compiled from publicly available complete genomes from Refseq NCBI, and are therefore general in scope and containing a very limited number of fungal species, as compared to bacterial genomes. Nevertheless, the tool allows users to use alternative custom-made databases. A prime example of a custom-made database for kraken2 is Humgut, a genome catalogue of human gut-specific prokaryotes [ 11 ]. Using this opportunity we set out to address some of the gaps mentioned above, with a particular focus on addressing the poor representation of fungal genomes in standard databases. For this we first compiled and curated a comprehensive dataset representing the broad diversity of available fungal genomes, which we combined with the above-mentioned Humgut catalogue. Finally we included the recently available telomere-to-telomere assembly of the human genome. The resulting broad gut microbiome database (B-GUT), was tested on data from defined microbial communities and on publicly available shotgun metagenomics data from fecal samples. Our results identified contamination and cross-mapping issues in publicly available fungal genomes, highlighting the need for data curation. In addition, we demonstrated the importance of using a telomere-to-telomere genome for efficiently removing host reads and minimizing wrong taxonomic assignments. Finally, we showed that the use of B-GUT as compared to the standard database provides a more precise characterization of metagenomics samples, enhancing biomarker discovery. METHODS Genome datasets We created a custom reference database for the Kraken2 assignment tool, comprising a total of 33,099 genomes. This database (B-GUT) builds on a previously curated prokaryotic genome database, comprising two large gut-specific genome catalogues: Humgut database [ 11 ] (which comprises data from the Unified Human Gastrointestinal Genomes catalog [ 3 ] and Refseq NCBI [ 12 ]), and the ArcheomeDb [ 13 ]. To this, we added the last version of the human genome (Telomere to telomere) reference [ 14 ] and a newly generated fungal genome dataset integrating curated (see below) and non-redundant data from Ensembl fungi [ 15 ], FungiDb [ 16 ], Mycocosm [ 17 ], and Refseq NCBI [ 18 ], all last accessed in July 2022. Contamination assessment and curation of fungal genomes Potential contamination in fungal genomes was initially assessed by running taxonomic assignment using the standard kraken2 database [ 10 ]. Additional analyses in specific genomes were performed using BUSCO (v. 5.2.2, database: bacteria_odb10) to calculate bacterial completeness among other metrics [ 19 ], QUAST (v. 5.0.2) to calculate gc content [ 20 ], as well as tiara (v. 1.0.3) [ 21 ], a deep learning tool, to assess the percentage of bacterial hits. A decontamination pipeline (B-GUT decontamination, available in our Github repository https://github.com/Gabaldonlab/B-GUT-decontamination ) was built to curate the fungal genomes in an automated manner. This pipeline included the following steps: removal of contigs smaller than 3kb and those classified as bacteria, archaea, prokaryote or unknown by tiara. After removal by tiara, potentially contaminated genomes are identified, and removed, if they fulfilled all of the following criteria: contained 5% or higher proportion of bacterial sequences (as assessed by Kraken2), with the most abundant bacterial species representing more than 5% of the bacterial sequences (as contaminations are expected to have a dominant species). The other requirement was based on the presence of multimodal GC content distribution filter, in which we calculated GC content using the seqKit tool (v. 2.3.0) and tested the statistical significance of the multimodality with the function dip_test from the R package diptest (v. 0.77-1)). The contig removal step (tiara) in the decontamination strategy was evaluated with a custom fasta file including genomes from 11 species in which we do not expect contaminations: 4 bacteria (Helicobacter pylori (NZ_CP071982.1), Mycobacterium tuberculosis (NCBI, NC_000962.3), Bartonella bacilliformis (NCBI, NZ_CP045671.1) and Xanthomonas euvesicatoria (NCBI, NZ_CP072268.1)) , 5 fungi ( Amniculicola lignicola (Mycocosm, Amnli1), Absidia glauca (Ensembl, GCA_900079185.1), Cryptococcus flavescens (NCBI, CAUG01000001.1), Claviceps sorghi (NCBI, SRPV01000001.1) and Epichloe glyceriae (FungiDb, GCA_000225285.2) ), 1 insect ( Homalodisca vitripennis (NCBI, NC_060207.1) ) and 1 virus ( Pseudomonas phage LUZ7 (NCBI, NC_013691.1) ) as a positive control. And also, some fungal models ( Candida albicans (Mycocosm, Canalb1), Candida glabrata (FungiDb,GCA_000002545.2),Aspergillus Fumigatus (Mycocosm, Aspfu_A1163_1),Saccharomyces cerevisiae (Mycocosm, SacceM3707_1), Neurospora crassa (Mycocosm, Neucr_trp3_1), Fusarium oxysporum (Mycocosm, FoxF1003_1), Crytococcus neoformans (Mycocosm, Cryne_JEC21_1),Penicillium digitatum (Mycocosm, Pendi1), Ustillago maydis (Mycocosm, Ustma2_2), Yarrowia lipolytica (NCBI, GCA_001761485.1), Allomyces macrogynus (Mycocosm, Allma1), Spizellomyces punctatus (Mycocosm, Spipu1), Schizosaccharomyces pombe (Mycocosm, Schpo1) and Laccaria bicolor (Mycocosm, Lacbi81306_1) . We further manually removed genomes from known Saccharomyces hybrids: namely Saccharomyces kudriavzevii (Ensembl, GCA_000167075.2) , Saccharomyces pastorianus (Ensembl, GCA_011022315.1) and Saccharomyces boulardi i (Mycoscosm, Sacboulardii_1) , as they caused the misassignment of Saccharomyces cerevisiae reads in our ZymoBIOMICS® mock and a previously sequenced mock community[ 22 ]. In addition, some genomes had to be discarded due to the presence of synonym species names or errors in the species names that prevented correct taxonomic annotation and conversion to the format required by kraken2. Metagenomics data analysis As positive controls, we used raw metagenomics sequencing data from two mock communities: one comprising 18 bacterial strains, 2 fungal strains and 1 archaeal strain (ZymoBIOMICS® Gut Microbiome Standard, reference D6331, ZymoResearch), which was sequenced in this study (see below), and another one containing 44 fungal strains (SRX10705695), which was obtained from [ 23 ]. We also used a dataset comprising sequencing data from 1,212 stool samples from a previously published meta-analysis [ 9 ], including data from PRJNA447983 [ 24 ], PRJEB27928 [ 25 ], PRJDB4176 [ 26 ], PRJEB10878 [ 27 ] studies. Another dataset from this meta-analysis, PRJNA389927 [ 28 ]], was not considered here due to its low sequencing quality and depth. We processed all the samples using the MeTAline (v0.8.0-alpha) pipeline ( https://zenodo.org/records/8221398 ) , using both the standard kraken2 database and our newly created B-GUT database for taxonomy assignment. Subsequently, we built a phyloseq (v. 1.38.0) [ 29 ] object linking the counts table and the metadata for each sample. Correlations of taxonomy assignment results for the mocks and their expected theoretical composition were calculated with cor.test (spearman method) from the stats R package (v. 4.1.2). For the meta-analysis data, we removed prokaryotic taxa with less than 100 reads or present in less than 25% of the samples. For eukaryotes, where a lower prevalence is expected, we removed taxa with less than 50 reads or present in less than 5 samples. We performed differential abundance analysis with the linda [ 30 ] function from MicrobiomeStat R package (v. 1.2) including as fixed effects the bioproject (to account for batch effect) and the disease status (restricting only to CRC vs healthy controls). Significance was considered as an adjusted p-value lower than 0.05. To assess the relevance of differentially abundant taxa, we assessed previous associations with CRC using two biological knowledge databases: the human gut microbiome atlas [ 31 ] and Disbiome [ 32 ] [Both databases last accessed the 16th of August, 2024] . To assess cross-mapping of human reads, we simulated 10 million reads from the T2T human genome of length 150 by using the wgsim programme (v. 0.3.1-r13), indicating as 0 the rate of mutations, base error rate and indels. Simulated reads were map to the reference T2T or GhC38 reference genome with hisat2 (v. 2.2.1) and unmapped reads were extracted using samtools (v. 1.3.1) Illumina sequencing of a bacterial mock community The ZymoBIOMICS® Gut Microbiome Standard was completely thawed on ice and mixed thoroughly by vortex to ensure cells were evenly resuspended. Next, 75 µl were used for DNA extraction following a phenol:chloroform-based DNA isolation method previously described[ 33 ]. The mock sample was sequenced using a whole genome shotgun approach with 2 x 150 bp paired-end libraries and BGI technology. RESULTS Curation of a pan-fungal genome reference database uncovers high levels of bacterial contamination. To alleviate the under-representation of fungi in reference genome databases for k-mer based metagenomic analyses, we aimed at creating a curated broad fungal genome reference dataset to complement available prokaryotic datasets. To this end, we compiled a non-redundant set (one representative genome per species) of 2,168 fungal genomes from major repositories, including Ensembl Fungi (141 genomes), FungiDb (64), Mycocosm (1,854) and Refseq (109). To assess the added value of the selected fungal genomes, and the potential presence of contaminant sequences in selected fungal genomes, we analyzed their sequences with kraken2 using the default standard complete database (Standard db from here on). Only 5 (7.8%), 18 (16.5%), 3 (2.13%) and 76 (4.07%) of the genomes from FungiDb, Refseq, Ensembl and Mycocosm, respectively, resulted in a unique assignment, and from these only 4 (6.25%), 13 (11.9%), 0 and 40 (2.14%), respectively, had unique assignments to the correct species. This result underscores the incompleteness of the default database and the added value of the compiled dataset. Importantly, our results also indicated a potentially high level of non-eukaryotic contamination, as 1738 (79.72%) of the assemblies had matches in one or more species outside Eukaryota, being Bacteria the most represented non-eukaryotic Kingdom (Supplementary material, Figure S1 ). This finding motivated us to develop a specific decontamination pipeline to identify and eliminate contaminated sequences, which combined kraken2 analysis with a deep learning sequence classifier and gc content assessment (see Materials and methods, Contamination assessment and curation of fungal genomes ). To validate this decontamination pipeline, we performed a detailed analysis of four FungiDB genomes from the same study [ 34 ], Amauroascus niger, Chrysosporium queenslandicum, Byssoonygena ceratinophila and Amauroascus mutatus , for which our pipeline detected a considerable percentage of bacterial contigs, as assessed by tiara (Supplementary material, Figure S2 C). We confirmed this by assessing other metrics: i) kraken2-inferred bacterial content represented, respectively, 52.54, 75.62, 47.85 and 74.83% of the sequences; ii) Bacterial BUSCO completeness was high 80.7, 29.8, 15.3 and 17.7%, respectively; and iii) GC content distributions were multimodal in the four species. Inspection of the kraken2 profiles uncovered Ramlibacter tataouinensis as the most represented bacterial taxon, accounting for 12.98%, 17.36%, 7.17% and 12.98% of the assigned reads, respectively, suggesting a potential main source of contamination in this study. As these strains originate from at least two different culture collections, contamination is likely to have appeared during preparation for sequencing. Additionally, we tested tiara, the tool used by our decontamination strategy to remove non-eukaryotic contigs, with a defined dataset of sequences from known species: four bacteria ( Helicobacter pylori, Mycobacterium tuberculosis, Bartonella bacilliformis and Xanthomonas euvesicatoria ), five fungi (Amniculicola lignicola, Absidia glauca, Cryptococcus flavescens, Claviceps sorghi and Epichloe glyceriae ), one insect ( Homalodisca vitripennis) and one virus ( Pseudomonas phage LUZ7 ). Our results on this custom dataset show a good performance of the tool with a single sequence, the viral Pseudomonas phage LUZ7 , being incorrectly classified as bacteria (Supplementary material, Table S1 ). Additionally we ran the whole pipeline on data from 14 selected genomes from model fungal organisms that can be assumed to be free of contamination because the organisms are available in axenic cultures and because the genomes have been extensively used (see Materials and methods, Contamination assessment and curation of fungal genomes ). Most contigs were classified as eukaryotic and the highest percentage of bacterial reads found was minimal (0.02% in Allomyces macrogynus , Supplementary material, Figure S3). From these results we concluded that our decontamination strategy was effective, and we therefore applied it to the whole fungal dataset, removing 58 genomes that did not pass our filters as well as some additional contaminated contigs. Additionally, we manually removed three known yeast hybrid species ( Saccharomyces kudriavzevii, Saccharomyces pastorianus and Saccharomyces boukardii ) that caused the misidentification of Saccharomyces cerevisiae reads (Supplementary material, Table S1 ). The final curated fungal dataset comprises curated genomes from 2,110 species. . B-GUT outperforms the complete Standard db in the analysis of synthetic mock communities We created a broad gut microbiome reference database (B-GUT), by adding to the above-described curated fungal database, existing curated reference genome datasets of gut-specific bacteria and archaea [ 11 , 13 ], and the telomere to telomere (T2T) reference human genome. To compare B-GUT with the widely used kraken2 complete standard database (Standard db) we used the same pipeline alternatively using each of the databases to analyze data from two mock microbial communities with known composition: a gut-specific microbiome mock including known quantities of cells from 15 prokaryotic species and 2 fungal species ( Candida albicans and Saccharomyces cerevisiae ), and a fungal-specific mock community comprising 44 fungal strains (39 different species). The use of B-GUT in the gut-specific mock resulted in higher correlations of obtained and expected results (Rho: 0.85355 and p-value: 1.32E-05) as compared to the standard database (Rho: 0.574676 and p-value: 0.0158, Table S1 ). Importantly, two species Veillonella rogosae and Prevotella corporis were exclusively detected with B-GUT (Supplementary material, Table S2 ). Additionally, B-GUT enhanced the detection of fungal species, as assessed on the fungal mock community, correctly identifying 28/44 strains as compared to 12/44 by the Standard db (Supplementary Figure S4). We inspected the 16 strains missed by B-GUT. Three strains from the same species Cryptococcus gattii VGI, Cryptococcus gattii VGII, Cryptococcus gattii VGIVa were not included in B-GUT due to the above mentioned selection of only one strain per species, but the species was correctly detected. Three additional genomes Filobasidium magnum (former Cryptococcus magnus ), Diutina catenulata (former Candida catenulata ), and Diutina mesorugosa (former Candida mesorugosa ) were locally sequenced and assembled by the authors of that mock community, and consequently not present in the databases from which we downloaded the included fungal genomes. Seven further species have sequences in NCBI Genbank, but not in NCBI Refseq. These cases are: Blastobotrys proliferans, Geotrichum fermentans , Kodamaea ohmeri, Meyerozyma caribbica, Pichia norvegensis , Scedosporium aurantiacum and Scedosporium boydii (former Pseudallescheria boydii ). In the case of Trichophyton rubrum , the species was included but the reads assignment went to another species from the same genus ( Cutaneotrichosporon dermatis (former Trichosporon dermatis ). Finally, Yarrowia Lipolytica was not added in our database, due to not passing our filters for decontamination. Upon closer inspection we noticed that this was caused due to a file naming error in the source database (JGI-mycocosm, Yarrowia lipolytica CLIB122) that resulted in download of the soft-masked version of the genome, which in turn prompted the label of “unknown” in Tiara classification. These results underscore the improved performance of B-GUT to identify fungi at the species level and point to remaining gaps. B-GUT improves the detection of potential gut microbiome biomarkers for colorectal cancer To further evaluate the performance of B-GUT and showcase its use on real case data, we re-analyzed a previously published meta-analysis study encompassing 1,329 fecal metagenomes and focused on the detection of biomarkers and potential associations of fungi with colorectal cancer (CRC) [ 9 ]. That study used a custom kraken2 database including 9,543 bacteria and 909 fungi, which represents, to the best of our knowledge, the metagenomics study using a custom database with the broadest coverage of fungi. However, no curation for potential sequence contaminants was performed. In addition, that custom database is not publicly available. From that study, we selected data with paired-end sequence data and sufficient depth of coverage, re-analyzed the data using the meTAline pipeline and either B-GUT or the Standard database, and performed differential abundance analysis between healthy and CRC samples (see Materials and Methods). We detected 297 differentially abundant prokaryotic taxa at the species level using B-GUT as compared to 445 using the Standard db (Fig. 2 ). Importantly, the two analyses overlapped only in 54 of the differentially-abundant species. This result underscores the high impact of database choice. To assess the consistency of each of the differentially abundant species dataset with previous knowledge on CRC associations, we mined information from these taxa in two biology knowledge databases [ 31 , 32 ]. We found a higher fraction of previous CRC associations in differentially abundant species detected with B-GUT (12.12%, 36/297) as compared to Standard db results (4.67%, 21/450). Although smaller, differences were also large when considering the genus level, 53/297 (17.95%) for B-GUT as compared to 50/450 (11,11%) for standard. These results suggest that the use of B-GUT improves the detection of meaningful differentially abundant taxa, particularly at the species level. Among the four top most significant differentially abundant species, three intersected between the two databases: namely, the widely claimed CRC-associated species Parvimonas micra , Gemella morbillorum and Fusobacterium nucleatum [ 35 – 37 ]. Consistently, in the Lin et al study [ 9 ], these three species are included as the top most important bacterial features in the machine learning classifier. Importantly, however, the top differentially abundant species found in the B-GUT analysis was Peptostreptococcus stomatis , which has been previously linked to CRC [ 38 ]. Importantly, this relevant species is absent from the Standard database and therefore was not identified either by the Lin et. al study or by our analysis when using the Standard db. This suggest that incompleteness in the Standard db can result in overlooking relevant biomarkers. Above the volcano plots, a Venn diagram representing the intersection of taxa at species level among the analysis using both the standard database (Orange) and the B-GUT database (Green). Inside each circle of the Venn diagram is indicated the % of taxa at species level previously associated with colorectal cancer(CRC). As for eukaryotes, B-GUT detected 16 significantly differentially abundant fungal species (Supplementary material, Table S4, T2T-B-GUT) as compared to only 4 eukaryotic species, including 2 fungal species, with the standard database, with a single species in common between B-GUT results and the reference study [ 9 ] ( Rhizophagus irregularis ) [ 9 ]. B-GUT detected differentially abundant fungal species included Saccharomyces cerevisiae and Kluyveromyces marxianus as overrepresented in healthy individuals as compared to CRC individuals, which is in line with previous studies showing a protective role of these species for CRC, through induction of apoptosis, resulting in inhibition of metastasis, proliferation and growth of tumors [ 39 , 40 ]. Cross-mapping of human reads and assembly contamination results in over-detection of fungal species Given the reasonably high fungal coverage in the database used in the reference study [ 9 ], we were expecting a higher overlap with our analysis in terms of differentially abundant fungal species, at least with respect to their top features (e.g. Aspergillus species). Lack of overlap between the two studies could be partly attributable to methodological differences with our re-analysis, such as the use of different data processing and differential abundance pipelines, as well as incomplete overlap of the analyzed data. We focused on two key methodological differences resulting from i) decontamination of fungal genomes in B-GUT but not in the reference study and ii) removal of host reads with the T2T human reference in our analysis as compared to GRCh38 in the reference study. To assess the impact of these two features, we repeated the analysis using variations of B-GUT with or without decontamination (B-GUT vs BGUTc, respectively) and the standard or telomere-to-telomere human reference genome (GRCh38 vs T2T, respectively), in all combinations, namely GRCh38-BGUT, GRCh38-BGUTc, T2T_BGUT and T2T-BGUTc. As shown in Fig. 3 , fungal abundances (Fig. 3 A right) but not bacterial abundances (Fig. 3 A left) exhibit distinct density distributions across the different assembly-db combinations, indicating that the use of B-GUT without decontamination leads to higher estimated fungal abunances. This trend is also observed when comparing fungi/bacteria abundance ratios (Fig. 3 B). The effect in fungal identification of the use of the T2T reference for host removal was comparatively smaller when the non decontaminated database was used (GRCh38-B-GUTc vs T2T-B-GUTc, Wilcoxon p-value = 0.04992). When assessing per-sample differences in inferred fungal abundances across different combinations of host-depletion and reference database we confirmed a consistent trend towards higher fungal abundance in all combinations as compared to T2T-B-GUT, with the use of B-GUTc (non-contaminated version of the database) causing the most acute differences (Fig. 3 C). These results, coupled to the results of our decontamination analysis explained above, indicates that the use of non-decontaminated fungal genomes as references leads to over-estimation of overall fungal content, likely resulting from mis-identification of bacterial reads as fungi. In addition, albeit with a comparative smaller effect, the use of a more comprehensive host read depletion using the T2T reference assembly further reduced the identification of fungal reads in the context of the decontaminated version of the database. We hypothesized that cross-mapping of repetitive human sequences (included in T2T but generally lacking GhC38 reference) to repetitive regions of fungal genomes may underlie this effect. To test this, we simulated reads from the T2T genome and mapped them to either T2T or the GhC38 reference. Whereas 100% mapping was achieved with the T2T we obtained an alignment rate of 96.8% on the GhC38 reference, indicating incomplete sequence representation. We extracted the T2T reads that were unmapped to GhC38 and analyzed them with kraken2 using either B-GUT or the standard database. With B-GUT we obtained 84.68% classified as human, 4.17% as Fungi and 9.53% as unclassified, suggesting at least a fraction of T2T-exclusive sequences can be classified as fungi. With the Standard database, which contains very few fungal genomes, only 0.09% of the reads were assigned to fungi, while 16.54% was classified as unclassified and 60.07% as Human. Surprisingly a significant percentage of these reads were assigned to the eukaryotic group Sar, specifically 7.16% were assigned to Toxoplasma gondii ME49 (Fig. 4 ). All in all, these results reinforce the need of using the T2T to deplete the human reads in order to avoid potential false positives. We next examined the impact of different assembly-db combinations in downstream analysis of differentially abundant fungal species when comparing CRC and healthy samples. We obtained 16 differentially abundant fungal species with T2T-B-GUT, 18 with GRCh38-B-GUT, 5 with T2T-B-GUTc and 7 with GRCh38-B-GUT (Supplementary material Table S4). Hence, although the use of the decontaminated database reduces the overall estimation of fungal reads it results in the detection of more differentially abundant species, likely due to less dispersed and noisy mapping. Comparatively the use of a more comprehensive host depletion with T2T had a minor effect, with Byssothecium circinans, Friedmanniomyces simplex, Heterodoassansia hygrophilae and Cerren unicolor uniquely detected when using GhC38. Notably, we only detected Aspergillus species, in line with the reference study, when using the contaminated version of B-GUT: Aspergillus keveii and Aspergillus austroafricanus (in GRCh38-B-GUTc), and Aspergillus austroafricanus (in T2T-B-GUTc). DISCUSSION Accurate taxonomic assignment is crucial to deriving meaningful insights from metagenomic datasets. This step critically depends on the comprehensiveness and quality of the reference genome database used for sequence comparison. While prokaryotic microorganisms, which dominate the biomass in many ecosystems [ 41 ], have been the primary focus of metagenomic studies, eukaryotic microorganisms, such as fungi, can exert significant roles within microbial communities despite their often lower abundance. Consequently, interest in the study of eukaryotes within the microbiome has grown in recent years. However, the limited representation of eukaryotic genomes in current reference databases hinders their investigation and restricts the discovery of potentially important ecological associations. To address this and other limitations, we have developed the Broad Gut Microbiome Database (B-GUT). This database offers three key advancements over the standard complete database provided by Kraken2: (i) it features a curated and human-gut-specific collection of prokaryotic genomes, (ii) it incorporates over 2,000 representative and contamination-screened fungal genomes, and (iii) it includes the T2T human reference genome to enhance the removal of repetitive host-derived sequences.A major finding of our study was the uncovering of a significant amount of bacterial sequences in publicly available fungal genomes, underscoring the need for decontamination. A key aspect of our work is the development and validation of a powerful decontamination pipeline. By integrating k-mer mapping, GC content analysis, and machine learning, this pipeline effectively identifies and removes contaminating contigs from assemblies. Application of this pipeline revealed various levels of contamination in publicly available datasets, highlighting potential sources of contamination in some cases. This decontamination tool could have future alternative uses including the sanity-check for newly assembled fungal genomes or the pre-filtering of fungal genomes before comparative genomics analyses. The newly developed B-GUT databases offered improved results over the standard database as assessed on two independent mock communities, particularly in fungi, but also in prokaryotic identification. Still some organisms of the mock community remained unidentified or misidentified, highlighting the need for future improvement. Lack of publicly available genome sequences or availability in different source databases were the major identified reasons underlying our false negatives. Additionally, we have showcased the use of B-GUT in a real-case scenario by reanalysing a published meta-analysis focused on the identification of CRC biomarkers. Our results are consistent with a high impact of the used reference database in metagenomics analysis [ 42 ], and underscore the importance of using niche-specific databases, given that the use of B-GUT as compared to the standard database produced results of greater biological significance based on prior knowledge. Differences found not only affected overall results, but also the top-ranking species with regard to associations with the trait of interest (CRC in this case). Notable, the use of B-GUT uniquely identified Peptostreptococcus stomatis as the differentially abundant taxon, which was undetected in the original study. P. stomatis has been identified as a CRC biomarker in multiple studies [ 35 , 43 – 45 ] However, it is not included in the standard Kraken2 database and therefore missed by studies using this database. Thus, the incorporation of P. stomatis as a feature in the machine learning models of [ 8 ] could potentially enhance the predictive accuracy. With respect to the identification of fungal microorganisms, the use of B-GUT led to the detection of a higher number of differentially abundant fungi compared to the standard database. The detected fungal species had higher biological relevance according to previous scientific articles, as compared to those identified using the standard database or in the reference study [ 8 ], which also used an extensive, albeit not curated, fungal database. Of note, our study uniquely detected two species previously found associated with CRC: Saccharomyces cerevisiae and Kluyveromyces marxianus. S. cerevisiae , which is an important component of the gut microbiome, was previously reported as depleted in CRC, as in our study [ 46 ] and have been shown to have protective effects for CRC [ 47 , 48 ]. K. marxianus has been used as a probiotic, and has also been shown to have potential beneficial roles for CRC [ 40 , 49 ] . Our result, directly comparing decontaminated to non decontaminated versions of B-GUT clearly showed that lack of decontamination leads to over-estimation of fungal content, likely resulting from bacterial reads mapping to contaminated contigs in fungal reference genomes. Of note, many of the differences between our analysis and the reference study [ 8 ] concerned the identification of Aspergillus species as differentially abundant only in the latter. Genomes from this genus were not flagged as potentially contaminated in our curation of the database, so we conclude that the assignments are likely real. However, the presence of contaminants in other fungal species may have influenced the overall composition and thus the differential abundant results. This effect suggests that certain contaminants, even if limited in number, may skew normalized abundance comparisons and impact statistical significance. Further research is required to clarify these indirect effects. Another innovation of our approach was the use of T2T human reference genome assembly to remove host reads, which has been previously shown to improve metagenomics analysis [ 50 ]. Our results show an overall significant impact of this addition in terms of fungal assignments when the decontamination database was used, suggesting the impact of contamination overrides that of the host removal approach. However, cross-mapping analyses showed that human reads removed uniquely by T2T do cross-map to other species, including fungi and protists such as Toxoplasma gondii , confirming the relevance of using this host-removal approach to avoid false positives. Our study has some limitations, such as the fact that our decontamination pipeline has a strict filter that is also removing contigs from the genomes that are classified as unknown , and might be discarding fungal contigs as it might happen in the case of Yarrowia lipolytica . The case of Yarrowia lipolytica was also observed with eight genomes from the same database (JGI, mycocosm) that were completely classified as unknown by tiara because the presence soft-masking: Arthroderma benhamiae , Aspergillus clavatus, Aspergillus nidulans, Cladosporium sphaerospermum, Kluyveromyces lactis, Neosartorya fischeri, Podospora anserina and Trichophyton verrucosum . Of note the download of a soft-masked version resulted from a file naming error in the source database, which attest for the importance of database curation. Nevertheless, not including these genomes is not critical for our purpose of having a broad fungal database to study the gut microbiome because they are not gut related and we have other species representing the corresponding genus. Furthermore, including the contigs classified as unknown might improve the inclusion of fungi but also has the risk of increasing the number of false positives. Although B-GUT has been specifically built for the kraken2 tool, the input and decontamination tool used can be applied to build any k-mer based database index (e.g for Centrifuge [ 51 ], Clark[ 52 ], K-Slam [ 53 ], etc). CONCLUSIONS All in all, our findings emphasize the importance of using well-curated, niche-specific databases for taxonomic assignment in metagenomic studies. By carefully evaluating and refining the genomes included in these databases, we can enhance the accuracy and biological relevance of the results, particularly in the identification of disease-associated biomarkers. Declarations AUTHORS’S CONTRIBUTION TG conceptualized and supervised the study. OKL performed all the analysis. All the authors designed the analysis, discussed the results and wrote the manuscript. FUNDING TG group acknowledges support from the Spanish Ministry of Science and Innovation (grant numbers PID2021-126067NB-I00, CPP2021-008552, PCI2022-135066-2, PLEC2023-010225, and PDC2022-133266-I00), cofounded by ERDF “A way of making Europe”, as well as support from the Catalan Research Agency (AGAUR) (grant number SGR01551); Gordon and Betty Moore Foundation (grant number GBMF9742); “La Caixa” foundation (grant number LCF/PR/HR21/00737), Fundació La Marató de TV3 (202328-31), AECC (PRYGN234923GABA), and Instituto de Salud Carlos III (IMPACT grant IMP/00019 and CIBERINFEC CB21/13/00061- ISCIII-SGEFI/ERDF). OKL is supported by the Formación de profesorado universitario (FPU) program from the Spanish Ministerio de Universidades (FPU2020-02907). Availability of data and materials Preprocessing pipeline and decontamination pipeline code is available on our GitHub repository :https://github.com/Gabaldonlab/meTAline and https://github.com/Gabaldonlab/B-GUT-decontamination Newly sequenced mock community is available at PRJNA1266093. This will be available upon publication but a reviewer’s link has been created: https://dataview.ncbi.nlm.nih.gov/object/PRJNA1266093?reviewer=kgb8ai7pg6emauesah1fm1ov1g Fungal mock community has been downloaded from SRX10705695 [23]. Metagenomics data for the meta-analysis has been downloaded from the corresponding bioprojects: PRJNA447983 [24], PRJEB27928 [25], PRJDB4176 [26] PRJEB10878 [27] B-GUT database can be downloaded from the Phylomedb FTP server, by first connecting to the ftp with the command “ftp phylomedb.org”. This will ask the user name, which is “anonymous” and the Password in which the user should not type anything, just hit the enter key on the keyboard. Once successfully logged in, move to the B-GUT folder by the command “cd B-GUT” to see the corresponding database files. Ethics approval and consent to participate Not applicable Consent for publication Not applicable Competing interests Not applicable References Stavrou AA. Misidentification of Genome Assemblies in Public Databases: the Case of Naumovozyma Dairenensis and Proposal of a Protocol to Correct Misidentifications. 2017. Lupo V, Van Vlierberghe M, Vanderschuren H, Kerff F, Baurain D, Cornet L. Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics. Front Microbiol. 2021;12:755101. Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39:105–14. Li W, Liang H, Lin X, Hu T, Wu Z, He W, et al. A catalog of bacterial reference genomes from cultivated human oral bacteria. npj Biofilms and Microbiomes. 2023;9:1–13. Chen T, Yu W-H, Izard J, Baranova OV, Lakshmanan A, Dewhirst FE. The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information. Database: The Journal of Biological Databases and Curation. 2010;2010:baq013. Belvoncikova P, Splichalova P, Videnska P, Gardlik R. The Human Mycobiome: Colonization, Composition and the Role in Health and Disease. J Fungi (Basel). 2022;8. Bahram M, Netherway T. Fungi as mediators linking organisms and ecosystems. FEMS Microbiol Rev. 2022;46. Gao R, Xia K, Wu M, Zhong H, Sun J, Zhu Y, et al. Alterations of Gut Mycobiota Profiles in Adenoma and Colorectal Cancer. Front Cell Infect Microbiol. 2022;12:839435. Lin Y, Lau HC-H, Liu Y, Kang X, Wang Y, Ting NL-N, et al. Altered Mycobiota Signatures and Enriched Pathogenic Aspergillus rambellii Are Associated With Colorectal Cancer Based on Multicohort Fecal Metagenomic Analyses. Gastroenterology. 2022;163:908–21. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:1–13. Hiseni P, Rudi K, Wilson RC, Hegge FT, Snipen L. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome. 2021;9:165. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–45. Chibani CM, Mahnert A, Borrel G, Almeida A, Werner A, Brugère J-F, et al. A catalogue of 1,167 genomes from the human gut archaeome. Nat Microbiol. 2022;7:48–61. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376:44–53. Ensembl Fungi. https://fungi.ensembl.org/index.html . Accessed 3 Jul 2024. FungiDB. https://fungidb.org/. Accessed 3 Jul 2024. Mycocosm. https://mycocosm.jgi.doe.gov/mycocosm/home . Accessed 3 Jul 2024. RefSeq: NCBI Reference Sequence Database. https://www.ncbi.nlm.nih.gov/refseq/ . Accessed 3 Jul 2024. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021;38:4647–54. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5. Karlicki M, Antonowicz S, Karnkowska A. Tiara: deep learning-based classification system for eukaryotic sequences. Bioinformatics. 2022;38:344–50. Yang F, Sun J, Luo H, Ren H, Zhou H, Lin Y, et al. Assessment of fecal DNA extraction protocols for metagenomic studies. GigaScience. 2020;9:giaa071. Hu Y, Irinyi L, Hoang MTV, Eenjes T, Graetz A, Stone EA, et al. Inferring Species Compositions of Complex Fungal Communities from Long- and Short-Read Sequence Data. MBio. 2022;13:e0244421. Thomas AM, Manghi P, Asnicar F, Pasolli E, Armanini F, Zolfo M, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med. 2019;25:667–78. Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med. 2019;25:679–89. Yachida S, Mizutani S, Shiroma H, Shiba S, Nakajima T, Sakamoto T, et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat Med. 2019;25:968–76. Yu J, Feng Q, Wong SH, Zhang D, Liang QY, Qin Y, et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 2017;66:70–8. Hannigan GD, Duhaime MB, Ruffin MT 4th, Koumpouras CC, Schloss PD. Diagnostic Potential and Interactive Dynamics of the Colorectal Cancer Virome. MBio. 2018;9. McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8:e61217. Zhou H, He K, Chen J, Zhang X. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol. 2022;23:1–23. Lee S, Portlock T, Le Chatelier E, Garcia-Guevara F, Clasen F, Oñate FP, et al. Global compositional and functional states of the human gut microbiome in health and disease. Genome Res. 2024;34:967–78. Janssens Y, Nielandt J, Bronselaer A, Debunne N, Verbeke F, Wynendaele E, et al. Disbiome database: linking the microbiome to disease. BMC Microbiol. 2018;18:50. Vesty A, Biswas K, Taylor MW, Gear K, Douglas RG. Evaluating the Impact of DNA Extraction Method on the Representation of Human Oral Bacterial and Fungal Communities. PLoS One. 2017;12:e0169877. Whiston E, Taylor JW. Comparative Phylogenomics of Pathogenic and Nonpathogenic Species. G3 Genes|Genomes|Genetics. 2016;6:235–44. Osman MA, Neoh H-M, Ab Mutalib N-S, Chin S-F, Mazlan L, Raja Ali RA, et al. Parvimonas micra, Peptostreptococcus stomatis, Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker panel of colorectal cancer. Sci Rep. 2021;11:1–12. Conde-Pérez K, Aja-Macaya P, Buetas E, Trigo-Tasende N, Nasser-Ali M, Rumbo-Feal S, et al. The multispecies microbial cluster of Fusobacterium, Parvimonas, Bacteroides and Faecalibacterium as a precision biomarker for colorectal cancer diagnosis. Mol Oncol. 2024;18:1093–122. Senthakumaran T, Tannæs TM, Moen AEF, Brackmann SA, Jahanlu D, Rounge TB, et al. Detection of colorectal-cancer-associated bacterial taxa in fecal samples using next-generation sequencing and 19 newly established qPCR assays. Mol Oncol. 2024. https://doi.org/10.1002/1878-0261.13700 . Huang P, Ji F, Cheung AH-K, Fu K, Zhou Q, Ding X, et al. Peptostreptococcus stomatis promotes colonic tumorigenesis and receptor tyrosine kinase inhibitor resistance by activating ERBB2-MAPK. Cell Host Microbe. 2024;32:1365–79.e10. Sambrani R, Abdolalizadeh J, Kohan L, Jafari B. Saccharomyces cerevisiae inhibits growth and metastasis and stimulates apoptosis in HT-29 colorectal cancer cell line. Comparative Clinical Pathology. 2018;28:985–95. Fortin O, Aguilar-Uscanga B, Vu KD, Salmieri S, Lacroix M. Cancer Chemopreventive, Antiproliferative, and Superoxide Anion Scavenging Properties of Kluyveromyces marxianus and Saccharomyces cerevisiae var. boulardii Cell Wall Components. Nutr Cancer. 2018;70:83–96. Bar-On YM, Phillips R, Milo R. The biomass distribution on Earth. Proc Natl Acad Sci U S A. 2018;115:6506–11. Smith RH, Glendinning L, Walker AW, Watson M. Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome. Anim Microbiome. 2022;4:57. Shen X, Li J, Li J, Zhang Y, Li X, Cui Y, et al. Fecal -- Biomarker for Noninvasive Diagnosis and Prognosis of Colorectal Laterally Spreading Tumor. Front Oncol. 2021;11:661048. Dai W, Li C, Li T, Hu J, Zhang H. Super-taxon in human microbiome are identified to be associated with colorectal cancer. BMC Bioinformatics. 2022;23:1–18. Avuthu N, Guda C. Meta-Analysis of Altered Gut Microbiota Reveals Microbial and Metabolic Biomarkers for Colorectal Cancer. Microbiol Spectr. 2022;10:e0001322. Coker OO, Nakatsu G, Dai RZ, Wu WKK, Wong SH, Ng SC, et al. Enteric fungal microbiota dysbiosis and ecological alterations in colorectal cancer. Gut. 2019;68:654–62. Li JQ, Li JL, Xie YH, Wang Y, Shen XN, Qian Y, et al. Saccharomyces cerevisiae may serve as a probiotic in colorectal cancer by promoting cancer cell apoptosis. J Dig Dis. 2020;21:571–82. Wang M, Gao C, Lessing DJ, Chu W. Saccharomyces cerevisiae SC-2201 Attenuates AOM/DSS-Induced Colorectal Cancer by Modulating the Gut Microbiome and Blocking Proinflammatory Mediators. Probiotics Antimicrob Proteins. 2024. https://doi.org/10.1007/s12602-024-10228-0 . Nag D, Goel A, Padwad Y, Singh D. In Vitro Characterisation Revealed Himalayan Dairy Kluyveromyces marxianus PCH397 as Potential Probiotic with Therapeutic Properties. Probiotics Antimicrob Proteins. 2023;15:761–73. Wang L, Xing G. Telomere-to-Telomere Assembly Improves Host Reads Removal in Metagenomic High-Throughput Sequencing of Human Samples. bioRxiv. 2023;:2023.05.05.539517. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–9. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16:236. Ainsworth D, Sternberg MJE, Raczy C, Butcher SA. k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets. Nucleic Acids Res. 2017;45:1649–56. Additional Declarations No competing interests reported. Supplementary Files SupplementarymaterialBGUT.pdf SupplementarytablesBGUT.xlsx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6766778","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Method Article","associatedPublications":[],"authors":[{"id":472146593,"identity":"68cc6f71-ff81-4691-8bdb-e4305fdd05cf","order_by":0,"name":"Olfat Khannous-Lleiffe","email":"","orcid":"","institution":"Barcelona Supercomputing Centre (BSC-CNS)","correspondingAuthor":false,"prefix":"","firstName":"Olfat","middleName":"","lastName":"Khannous-Lleiffe","suffix":""},{"id":472146595,"identity":"db635587-72fa-4fef-85e0-18398bd2244a","order_by":1,"name":"Toni Gabaldón","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4klEQVRIiWNgGAWjYJCCA0CcwMDOYMDAUHGAsHIeuBZmkJYzRGphgGthbCNCiz1778EDDL/s8vibmTc+Lpx3x56/gfnwB7y28JxLOMDYl1wscZit2HjmtmeJMw6wpUng1SKRY3CAsYc5seEwj5k077bDCQYMPGb4/QLRUp84H6xlzmF7Awb+z/gdBtLC8ONw4gawlobDjBuAQYLfYWfOGBxIbDhebAjyC88xoF8Os5nh1cLe3mP84cOf6jy5480bH/PUAEOsvfkxXoeBQWIbMo+ZoHoQ+EOUqlEwCkbBKBipAADOtUpkaeNIeQAAAABJRU5ErkJggg==","orcid":"","institution":"Barcelona Supercomputing Centre (BSC-CNS)","correspondingAuthor":true,"prefix":"","firstName":"Toni","middleName":"","lastName":"Gabaldón","suffix":""}],"badges":[],"createdAt":"2025-05-28 09:53:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6766778/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6766778/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":85496767,"identity":"14f748a4-abfc-48c9-9363-bfb403f0c968","added_by":"auto","created_at":"2025-06-26 13:57:34","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":150843,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic representation of the B-GUT database. Pink databases correspond to prokaryotic big catalogues. Orange to the telomere to telomere version of the human genome and the blue database to the new in-house created broad fungal database including genomes from Ensembl Fungi, Refseq from NCBI, FungiDb and Mycocosm. Numbers in parentheses indicate the number of included genomes from each database.\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/ca90761bbe4116ac20eef0f3.jpeg"},{"id":85496766,"identity":"4619ae71-c833-4dee-b53d-4048e7324c15","added_by":"auto","created_at":"2025-06-26 13:57:34","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":151544,"visible":true,"origin":"","legend":"\u003cp\u003eDifferential abundant results. Volcano plot representing the results obtained in the differential analysis comparing healthy and CRC individuals. In the y axis is depicted the -log10 from the adjusted p-value (the higher the transformed adjusted p-value, the more significant the differential feature). In the x-axis is represented the Log2FoldChange (A Negative value indicates an overrepresentation of the feature in CRC). The cut-offs used for the representation: p.adjusted = 0.0000000001 and LFC cut = 1.5. Highlighted in a red square the intersection in the top 4 most significantly differentially abundant taxa: 33033: \u003cem\u003eParvimonas micra\u003c/em\u003e, 29391: \u003cem\u003eGemella mornillorum\u003c/em\u003e, 851: \u003cem\u003eFusobacterium nucleatum.\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003eAbove the volcano plots, a Venn diagram representing the intersection of taxa at species level among the analysis using both the standard database (Orange) and the B-GUT database (Green). Inside each circle of the Venn diagram is indicated the % of taxa at species level previously associated with colorectal cancer(CRC).\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/a3cb694273108e6ea709acf2.jpeg"},{"id":85496775,"identity":"3a9153e5-ba8f-4525-b126-1ec4fa310d86","added_by":"auto","created_at":"2025-06-26 13:57:34","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":562660,"visible":true,"origin":"","legend":"\u003cp\u003eA) Density plots depicting distributions of Bacterial and Fungal reads (log10) according to each assembly-db strategy. B) Violin plots representing the log10 of the ratio between Fungi and Bacteria according to each assembly-db approach. Wilcoxon test p-values comparing the groups are represented. C) Distribution of values corresponding to per-sample differences in the amount of detected fungal reads (log10) when different host depletion approaches and reference databases are used.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/9750acf2d92e6b11c1cad78c.jpeg"},{"id":85496778,"identity":"fddb8cd2-492e-4676-bae2-9c76afe973d3","added_by":"auto","created_at":"2025-06-26 13:57:34","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":489814,"visible":true,"origin":"","legend":"\u003cp\u003ePlot representing the taxonomic assignment of the simulated T2T reads that do not map to GhC38. At the top of the graph we have sankey plots, in which we represent kraken2 assignment of unmapped reads not classified as Human or unclassified: on the left it is shown the assignment using the Standard database, and on the right the B-GUT database. Number of reads assigned to each level is indicated above each name (e.g 45745 reads assigned to \u003cem\u003eToxoplasma gondii\u003c/em\u003e are indicated with 45.7k). Percentages representing human and unclassified reads are shown on the stacked barplot below.\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/693c1a8633bc23ff6a059272.jpeg"},{"id":85653598,"identity":"5ad67829-ab64-405c-964a-63cf25daecbe","added_by":"auto","created_at":"2025-06-30 09:54:08","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2092409,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/4c5ed104-217d-4e67-9b48-3ced6004e597.pdf"},{"id":85496769,"identity":"9ffd6a2a-05e9-466c-941d-4e7a438cbd9f","added_by":"auto","created_at":"2025-06-26 13:57:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":550337,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementarymaterialBGUT.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/bce736a8d8b7e15f55bb5880.pdf"},{"id":85498693,"identity":"acb06afb-430d-4c54-9122-5c95d7c7e500","added_by":"auto","created_at":"2025-06-26 14:13:34","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":14750,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementarytablesBGUT.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6766778/v1/991de1f4e2f1a10d8579b0d2.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"B-GUT reference genome database improves biomarker discovery and fungal identification in gut metagenomes","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eMicrobiome research has revolutionized our understanding of complex microbial communities and their interactions, impacting diverse fields such as human health and environmental research. Microbiome analysis relies on cutting-edge sequencing technologies, with whole-genome shotgun sequencing (i.e. metagenomics) emerging as the approach of choice when a highly-resolved taxonomic information or functional insights are needed [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. A crucial step in metagenomics is taxonomic assignment, the process of inferring the most likely taxonomic affiliation for each sequencing read, which allows determining the presence and abundance of different taxa within a sample. One approach for taxonomic assignment relies on comparing the sequence of reads to those of reference genomes compiled in a database. Importantly, the accuracy and resolution of the taxonomic assignment are inherently dependent on the quality and comprehensiveness of the genome database used for the assignment. However, while much focus is placed on the algorithms to perform these comparisons, standard reference databases often fall short of providing a complete and accurate representation of microbial communities, due to poor representation of certain groups, or to the presence of mis-annotated or contaminated reference genomes [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Additionally, recent research has shown that the use of so-called niche-specific databases -i.e. databases containing only genomes from species or strains isolated in a given niche, like the human gut - can reduce ambiguity in taxonomic assignments as compared to non-specific databases by exploiting prior knowledge of taxa previously identified in a given niche [\u003cspan additionalcitationids=\"CR4\" citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Finally, existing databases are biased towards prokaryotes and there is a need to include genomic information from microbial eukaryotes. In this regard fungi are ubiquitous components of the human microbiome, where they play key roles despite their generally low relative abundance [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. For instance, fungi inhabiting the gut have been shown to have important impacts in the pathogenesis in compromised individuals, as a cofactor in inflammatory disease, and in the development of colorectal cancer (CRC) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOne of the most widely used tools for taxonomy assignment in metagenomics is Kraken2 [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], which is a classification system based on matching k-mers (short subsequences) from sequencing reads to a reference genome database. Some pre-built kraken2 reference genome databases are available in an amazon web service linked to the developer\u0026rsquo;s website. These references are compiled from publicly available complete genomes from Refseq NCBI, and are therefore general in scope and containing a very limited number of fungal species, as compared to bacterial genomes. Nevertheless, the tool allows users to use alternative custom-made databases. A prime example of a custom-made database for kraken2 is Humgut, a genome catalogue of human gut-specific prokaryotes [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Using this opportunity we set out to address some of the gaps mentioned above, with a particular focus on addressing the poor representation of fungal genomes in standard databases. For this we first compiled and curated a comprehensive dataset representing the broad diversity of available fungal genomes, which we combined with the above-mentioned Humgut catalogue. Finally we included the recently available telomere-to-telomere assembly of the human genome. The resulting broad gut microbiome database (B-GUT), was tested on data from defined microbial communities and on publicly available shotgun metagenomics data from fecal samples. Our results identified contamination and cross-mapping issues in publicly available fungal genomes, highlighting the need for data curation. In addition, we demonstrated the importance of using a telomere-to-telomere genome for efficiently removing host reads and minimizing wrong taxonomic assignments. Finally, we showed that the use of B-GUT as compared to the standard database provides a more precise characterization of metagenomics samples, enhancing biomarker discovery.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eGenome datasets\u003c/h2\u003e \u003cp\u003eWe created a custom reference database for the Kraken2 assignment tool, comprising a total of 33,099 genomes. This database (B-GUT) builds on a previously curated prokaryotic genome database, comprising two large gut-specific genome catalogues: Humgut database [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] (which comprises data from the Unified Human Gastrointestinal Genomes catalog [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] and Refseq NCBI [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]), and the ArcheomeDb [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. To this, we added the last version of the human genome (Telomere to telomere) reference [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] and a newly generated fungal genome dataset integrating curated (see below) and non-redundant data from Ensembl fungi [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], FungiDb [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e], Mycocosm [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], and Refseq NCBI [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], all last accessed in July 2022.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eContamination assessment and curation of fungal genomes\u003c/h3\u003e\n\u003cp\u003ePotential contamination in fungal genomes was initially assessed by running taxonomic assignment using the standard kraken2 database [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Additional analyses in specific genomes were performed using BUSCO (v. 5.2.2, database: bacteria_odb10) to calculate bacterial completeness among other metrics [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], QUAST (v. 5.0.2) to calculate gc content [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e], as well as tiara (v. 1.0.3) [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], a deep learning tool, to assess the percentage of bacterial hits.\u003c/p\u003e \u003cp\u003eA decontamination pipeline (B-GUT decontamination, available in our Github repository \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/Gabaldonlab/B-GUT-decontamination\u003c/span\u003e\u003cspan address=\"https://github.com/Gabaldonlab/B-GUT-decontamination\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e)\u003c/span\u003e was built to curate the fungal genomes in an automated manner. This pipeline included the following steps: removal of contigs smaller than 3kb and those classified as bacteria, archaea, prokaryote or unknown by tiara. After removal by tiara, potentially contaminated genomes are identified, and removed, if they fulfilled all of the following criteria: contained 5% or higher proportion of bacterial sequences (as assessed by Kraken2), with the most abundant bacterial species representing more than 5% of the bacterial sequences (as contaminations are expected to have a dominant species). The other requirement was based on the presence of multimodal GC content distribution filter, in which we calculated GC content using the seqKit tool (v. 2.3.0) and tested the statistical significance of the multimodality with the function dip_test from the R package diptest (v. 0.77-1)).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe contig removal step (tiara) in the decontamination strategy was evaluated with a custom fasta file including genomes from 11 species in which we do not expect contaminations: 4 bacteria \u003cem\u003e(Helicobacter pylori (NZ_CP071982.1), Mycobacterium tuberculosis (NCBI, NC_000962.3), Bartonella bacilliformis (NCBI, NZ_CP045671.1)\u003c/em\u003e and \u003cem\u003eXanthomonas euvesicatoria (NCBI, NZ_CP072268.1))\u003c/em\u003e, 5 fungi (\u003cem\u003eAmniculicola lignicola (Mycocosm, Amnli1), Absidia glauca (Ensembl, GCA_900079185.1), Cryptococcus flavescens (NCBI, CAUG01000001.1), Claviceps sorghi (NCBI, SRPV01000001.1)\u003c/em\u003e and \u003cem\u003eEpichloe glyceriae (FungiDb, GCA_000225285.2)\u003c/em\u003e), 1 insect ( \u003cem\u003eHomalodisca vitripennis (NCBI, NC_060207.1)\u003c/em\u003e) and 1 virus (\u003cem\u003ePseudomonas phage LUZ7 (NCBI, NC_013691.1)\u003c/em\u003e) as a positive control. And also, some fungal models (\u003cem\u003eCandida albicans (Mycocosm, Canalb1), Candida glabrata (FungiDb,GCA_000002545.2),Aspergillus Fumigatus (Mycocosm, Aspfu_A1163_1),Saccharomyces cerevisiae (Mycocosm, SacceM3707_1), Neurospora crassa (Mycocosm, Neucr_trp3_1), Fusarium oxysporum (Mycocosm, FoxF1003_1), Crytococcus neoformans (Mycocosm, Cryne_JEC21_1),Penicillium digitatum (Mycocosm, Pendi1), Ustillago maydis (Mycocosm, Ustma2_2), Yarrowia lipolytica (NCBI, GCA_001761485.1), Allomyces macrogynus (Mycocosm, Allma1), Spizellomyces punctatus (Mycocosm, Spipu1), Schizosaccharomyces pombe (Mycocosm, Schpo1) and Laccaria bicolor (Mycocosm, Lacbi81306_1)\u003c/em\u003e.\u003c/p\u003e \u003cp\u003eWe further manually removed genomes from known \u003cem\u003eSaccharomyces\u003c/em\u003e hybrids: namely \u003cem\u003eSaccharomyces kudriavzevii (Ensembl, GCA_000167075.2)\u003c/em\u003e, \u003cem\u003eSaccharomyces pastorianus (Ensembl, GCA_011022315.1)\u003c/em\u003e and \u003cem\u003eSaccharomyces boulardi\u003c/em\u003ei \u003cem\u003e(Mycoscosm, Sacboulardii_1)\u003c/em\u003e, as they caused the misassignment of \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e reads in our ZymoBIOMICS\u0026reg; mock and a previously sequenced mock community[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn addition, some genomes had to be discarded due to the presence of synonym species names or errors in the species names that prevented correct taxonomic annotation and conversion to the format required by kraken2.\u003c/p\u003e\n\u003ch3\u003eMetagenomics data analysis\u003c/h3\u003e\n\u003cp\u003eAs positive controls, we used raw metagenomics sequencing data from two mock communities: one comprising 18 bacterial strains, 2 fungal strains and 1 archaeal strain (ZymoBIOMICS\u0026reg; Gut Microbiome Standard, reference D6331, ZymoResearch), which was sequenced in this study (see below), and another one containing 44 fungal strains (SRX10705695), which was obtained from [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWe also used a dataset comprising sequencing data from 1,212 stool samples from a previously published meta-analysis [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], including data from PRJNA447983 [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], PRJEB27928 [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e], PRJDB4176 [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e], PRJEB10878 [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] studies. Another dataset from this meta-analysis, PRJNA389927 [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]], was not considered here due to its low sequencing quality and depth.\u003c/p\u003e \u003cp\u003eWe processed all the samples using the MeTAline (v0.8.0-alpha) pipeline (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://zenodo.org/records/8221398\u003c/span\u003e\u003cspan address=\"https://zenodo.org/records/8221398\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003cspan type=\"Underline\" class=\"Underline\" name=\"Emphasis\"\u003e)\u003c/span\u003e, using both the standard kraken2 database and our newly created B-GUT database for taxonomy assignment. Subsequently, we built a phyloseq (v. 1.38.0) [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e] object linking the counts table and the metadata for each sample.\u003c/p\u003e \u003cp\u003eCorrelations of taxonomy assignment results for the mocks and their expected theoretical composition were calculated with cor.test (spearman method) from the stats R package (v. 4.1.2).\u003c/p\u003e \u003cp\u003eFor the meta-analysis data, we removed prokaryotic taxa with less than 100 reads or present in less than 25% of the samples. For eukaryotes, where a lower prevalence is expected, we removed taxa with less than 50 reads or present in less than 5 samples. We performed differential abundance analysis with the linda [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e] function from MicrobiomeStat R package (v. 1.2) including as fixed effects the bioproject (to account for batch effect) and the disease status (restricting only to CRC vs healthy controls). Significance was considered as an adjusted p-value lower than 0.05.\u003c/p\u003e \u003cp\u003eTo assess the relevance of differentially abundant taxa, we assessed previous associations with CRC using two biological knowledge databases: the human gut microbiome atlas [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] and Disbiome [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] [Both databases last accessed the 16th of August, 2024] .\u003c/p\u003e \u003cp\u003eTo assess cross-mapping of human reads, we simulated 10\u0026nbsp;million reads from the T2T human genome of length 150 by using the wgsim programme (v. 0.3.1-r13), indicating as 0 the rate of mutations, base error rate and indels. Simulated reads were map to the reference T2T or GhC38 reference genome with hisat2 (v. 2.2.1) and unmapped reads were extracted using samtools (v. 1.3.1)\u003c/p\u003e\n\u003ch3\u003eIllumina sequencing of a bacterial mock community\u003c/h3\u003e\n\u003cp\u003eThe ZymoBIOMICS\u0026reg; Gut Microbiome Standard was completely thawed on ice and mixed thoroughly by vortex to ensure cells were evenly resuspended. Next, 75 \u0026micro;l were used for DNA extraction following a phenol:chloroform-based DNA isolation method previously described[\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. The mock sample was sequenced using a whole genome shotgun approach with 2 x 150 bp paired-end libraries and BGI technology.\u003c/p\u003e"},{"header":"RESULTS","content":"\u003cp\u003e \u003cb\u003eCuration of a pan-fungal genome reference database uncovers high levels of bacterial contamination.\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo alleviate the under-representation of fungi in reference genome databases for k-mer based metagenomic analyses, we aimed at creating a curated broad fungal genome reference dataset to complement available prokaryotic datasets. To this end, we compiled a non-redundant set (one representative genome per species) of 2,168 fungal genomes from major repositories, including Ensembl Fungi (141 genomes), FungiDb (64), Mycocosm (1,854) and Refseq (109).\u003c/p\u003e \u003cp\u003eTo assess the added value of the selected fungal genomes, and the potential presence of contaminant sequences in selected fungal genomes, we analyzed their sequences with kraken2 using the default standard complete database (Standard db from here on). Only 5 (7.8%), 18 (16.5%), 3 (2.13%) and 76 (4.07%) of the genomes from FungiDb, Refseq, Ensembl and Mycocosm, respectively, resulted in a unique assignment, and from these only 4 (6.25%), 13 (11.9%), 0 and 40 (2.14%), respectively, had unique assignments to the correct species. This result underscores the incompleteness of the default database and the added value of the compiled dataset. Importantly, our results also indicated a potentially high level of non-eukaryotic contamination, as 1738 (79.72%) of the assemblies had matches in one or more species outside Eukaryota, being Bacteria the most represented non-eukaryotic Kingdom (Supplementary material, Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). This finding motivated us to develop a specific decontamination pipeline to identify and eliminate contaminated sequences, which combined kraken2 analysis with a deep learning sequence classifier and gc content assessment (see Materials and methods, \u003cem\u003eContamination assessment and curation of fungal genomes\u003c/em\u003e).\u003c/p\u003e \u003cp\u003eTo validate this decontamination pipeline, we performed a detailed analysis of four FungiDB genomes from the same study [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e], \u003cem\u003eAmauroascus niger, Chrysosporium queenslandicum, Byssoonygena ceratinophila\u003c/em\u003e and \u003cem\u003eAmauroascus mutatus\u003c/em\u003e, for which our pipeline detected a considerable percentage of bacterial contigs, as assessed by tiara (Supplementary material, Figure \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003eC). We confirmed this by assessing other metrics: i) kraken2-inferred bacterial content represented, respectively, 52.54, 75.62, 47.85 and 74.83% of the sequences; ii) Bacterial BUSCO completeness was high 80.7, 29.8, 15.3 and 17.7%, respectively; and iii) GC content distributions were multimodal in the four species. Inspection of the kraken2 profiles uncovered \u003cem\u003eRamlibacter tataouinensis\u003c/em\u003e as the most represented bacterial taxon, accounting for 12.98%, 17.36%, 7.17% and 12.98% of the assigned reads, respectively, suggesting a potential main source of contamination in this study. As these strains originate from at least two different culture collections, contamination is likely to have appeared during preparation for sequencing.\u003c/p\u003e \u003cp\u003eAdditionally, we tested tiara, the tool used by our decontamination strategy to remove non-eukaryotic contigs, with a defined dataset of sequences from known species: four bacteria (\u003cem\u003eHelicobacter pylori, Mycobacterium tuberculosis, Bartonella bacilliformis\u003c/em\u003e and \u003cem\u003eXanthomonas euvesicatoria\u003c/em\u003e), five fungi \u003cem\u003e(Amniculicola lignicola, Absidia glauca, Cryptococcus flavescens, Claviceps sorghi\u003c/em\u003e and \u003cem\u003eEpichloe glyceriae\u003c/em\u003e), one insect ( \u003cem\u003eHomalodisca vitripennis)\u003c/em\u003e and one virus (\u003cem\u003ePseudomonas phage LUZ7\u003c/em\u003e). Our results on this custom dataset show a good performance of the tool with a single sequence, the viral \u003cem\u003ePseudomonas phage LUZ7\u003c/em\u003e, being incorrectly classified as bacteria (Supplementary material, Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). Additionally we ran the whole pipeline on data from 14 selected genomes from model fungal organisms that can be assumed to be free of contamination because the organisms are available in axenic cultures and because the genomes have been extensively used (see Materials and methods, \u003cem\u003eContamination assessment and curation of fungal genomes\u003c/em\u003e). Most contigs were classified as eukaryotic and the highest percentage of bacterial reads found was minimal (0.02% in \u003cem\u003eAllomyces macrogynus\u003c/em\u003e, Supplementary material, Figure S3).\u003c/p\u003e \u003cp\u003eFrom these results we concluded that our decontamination strategy was effective, and we therefore applied it to the whole fungal dataset, removing 58 genomes that did not pass our filters as well as some additional contaminated contigs. Additionally, we manually removed three known yeast hybrid species (\u003cem\u003eSaccharomyces kudriavzevii, Saccharomyces pastorianus\u003c/em\u003e and \u003cem\u003eSaccharomyces boukardii\u003c/em\u003e) that caused the misidentification of \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e reads (Supplementary material, Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). The final curated fungal dataset comprises curated genomes from 2,110 species. .\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eB-GUT outperforms the complete Standard db in the analysis of synthetic mock communities\u003c/h2\u003e \u003cp\u003eWe created a broad gut microbiome reference database (B-GUT), by adding to the above-described curated fungal database, existing curated reference genome datasets of gut-specific bacteria and archaea [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], and the telomere to telomere (T2T) reference human genome. To compare B-GUT with the widely used kraken2 complete standard database (Standard db) we used the same pipeline alternatively using each of the databases to analyze data from two mock microbial communities with known composition: a gut-specific microbiome mock including known quantities of cells from 15 prokaryotic species and 2 fungal species (\u003cem\u003eCandida albicans\u003c/em\u003e and \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e), and a fungal-specific mock community comprising 44 fungal strains (39 different species). The use of B-GUT in the gut-specific mock resulted in higher correlations of obtained and expected results (Rho: 0.85355 and p-value: 1.32E-05) as compared to the standard database (Rho: 0.574676 and p-value: 0.0158, Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). Importantly, two species \u003cem\u003eVeillonella rogosae and Prevotella corporis\u003c/em\u003e were exclusively detected with B-GUT (Supplementary material, Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). Additionally, B-GUT enhanced the detection of fungal species, as assessed on the fungal mock community, correctly identifying 28/44 strains as compared to 12/44 by the Standard db (Supplementary Figure S4). We inspected the 16 strains missed by B-GUT. Three strains from the same species \u003cem\u003eCryptococcus gattii VGI, Cryptococcus gattii VGII, Cryptococcus gattii VGIVa\u003c/em\u003e were not included in B-GUT due to the above mentioned selection of only one strain per species, but the species was correctly detected. Three additional genomes \u003cem\u003eFilobasidium magnum\u003c/em\u003e (former \u003cem\u003eCryptococcus magnus\u003c/em\u003e), \u003cem\u003eDiutina catenulata\u003c/em\u003e (former \u003cem\u003eCandida catenulata\u003c/em\u003e), and \u003cem\u003eDiutina mesorugosa\u003c/em\u003e (former \u003cem\u003eCandida mesorugosa\u003c/em\u003e) were locally sequenced and assembled by the authors of that mock community, and consequently not present in the databases from which we downloaded the included fungal genomes. Seven further species have sequences in NCBI Genbank, but not in NCBI Refseq.\u0026nbsp;These cases are: \u003cem\u003eBlastobotrys proliferans, Geotrichum fermentans\u003c/em\u003e, \u003cem\u003eKodamaea ohmeri, Meyerozyma caribbica, Pichia norvegensis\u003c/em\u003e, \u003cem\u003eScedosporium aurantiacum\u003c/em\u003e and \u003cem\u003eScedosporium boydii\u003c/em\u003e (former \u003cem\u003ePseudallescheria boydii\u003c/em\u003e). In the case of \u003cem\u003eTrichophyton rubrum\u003c/em\u003e, the species was included but the reads assignment went to another species from the same genus (\u003cem\u003eCutaneotrichosporon dermatis\u003c/em\u003e (former \u003cem\u003eTrichosporon dermatis\u003c/em\u003e). Finally, \u003cem\u003eYarrowia Lipolytica\u003c/em\u003e was not added in our database, due to not passing our filters for decontamination. Upon closer inspection we noticed that this was caused due to a file naming error in the source database (JGI-mycocosm, Yarrowia lipolytica CLIB122) that resulted in download of the soft-masked version of the genome, which in turn prompted the label of \u0026ldquo;unknown\u0026rdquo; in Tiara classification. These results underscore the improved performance of B-GUT to identify fungi at the species level and point to remaining gaps.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eB-GUT improves the detection of potential gut microbiome biomarkers for colorectal cancer\u003c/h3\u003e\n\u003cp\u003eTo further evaluate the performance of B-GUT and showcase its use on real case data, we re-analyzed a previously published meta-analysis study encompassing 1,329 fecal metagenomes and focused on the detection of biomarkers and potential associations of fungi with colorectal cancer (CRC) [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. That study used a custom kraken2 database including 9,543 bacteria and 909 fungi, which represents, to the best of our knowledge, the metagenomics study using a custom database with the broadest coverage of fungi. However, no curation for potential sequence contaminants was performed. In addition, that custom database is not publicly available. From that study, we selected data with paired-end sequence data and sufficient depth of coverage, re-analyzed the data using the meTAline pipeline and either B-GUT or the Standard database, and performed differential abundance analysis between healthy and CRC samples (see Materials and Methods).\u003c/p\u003e \u003cp\u003eWe detected 297 differentially abundant prokaryotic taxa at the species level using B-GUT as compared to 445 using the Standard db (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Importantly, the two analyses overlapped only in 54 of the differentially-abundant species. This result underscores the high impact of database choice.\u003c/p\u003e \u003cp\u003eTo assess the consistency of each of the differentially abundant species dataset with previous knowledge on CRC associations, we mined information from these taxa in two biology knowledge databases [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. We found a higher fraction of previous CRC associations in differentially abundant species detected with B-GUT (12.12%, 36/297) as compared to Standard db results (4.67%, 21/450). Although smaller, differences were also large when considering the genus level, 53/297 (17.95%) for B-GUT as compared to 50/450 (11,11%) for standard. These results suggest that the use of B-GUT improves the detection of meaningful differentially abundant taxa, particularly at the species level.\u003c/p\u003e \u003cp\u003eAmong the four top most significant differentially abundant species, three intersected between the two databases: namely, the widely claimed CRC-associated species \u003cem\u003eParvimonas micra\u003c/em\u003e, \u003cem\u003eGemella morbillorum\u003c/em\u003e and \u003cem\u003eFusobacterium nucleatum\u003c/em\u003e [\u003cspan additionalcitationids=\"CR36\" citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]. Consistently, in the \u003cem\u003eLin et al\u003c/em\u003e study [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], these three species are included as the top most important bacterial features in the machine learning classifier. Importantly, however, the top differentially abundant species found in the B-GUT analysis was \u003cem\u003ePeptostreptococcus stomatis\u003c/em\u003e, which has been previously linked to CRC [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. Importantly, this relevant species is absent from the Standard database and therefore was not identified either by the Lin et. al study or by our analysis when using the Standard db. This suggest that incompleteness in the Standard db can result in overlooking relevant biomarkers.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAbove the volcano plots, a Venn diagram representing the intersection of taxa at species level among the analysis using both the standard database (Orange) and the B-GUT database (Green). Inside each circle of the Venn diagram is indicated the % of taxa at species level previously associated with colorectal cancer(CRC).\u003c/p\u003e \u003cp\u003eAs for eukaryotes, B-GUT detected 16 significantly differentially abundant fungal species (Supplementary material, Table S4, T2T-B-GUT) as compared to only 4 eukaryotic species, including 2 fungal species, with the standard database, with a single species in common between B-GUT results and the reference study [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] (\u003cem\u003eRhizophagus irregularis\u003c/em\u003e) [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. B-GUT detected differentially abundant fungal species included \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e and \u003cem\u003eKluyveromyces marxianus\u003c/em\u003e as overrepresented in healthy individuals as compared to CRC individuals, which is in line with previous studies showing a protective role of these species for CRC, through induction of apoptosis, resulting in inhibition of metastasis, proliferation and growth of tumors [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e, \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e].\u003c/p\u003e\n\u003ch3\u003eCross-mapping of human reads and assembly contamination results in over-detection of fungal species\u003c/h3\u003e\n\u003cp\u003eGiven the reasonably high fungal coverage in the database used in the reference study [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], we were expecting a higher overlap with our analysis in terms of differentially abundant fungal species, at least with respect to their top features (e.g. \u003cem\u003eAspergillus\u003c/em\u003e species). Lack of overlap between the two studies could be partly attributable to methodological differences with our re-analysis, such as the use of different data processing and differential abundance pipelines, as well as incomplete overlap of the analyzed data. We focused on two key methodological differences resulting from i) decontamination of fungal genomes in B-GUT but not in the reference study and ii) removal of host reads with the T2T human reference in our analysis as compared to GRCh38 in the reference study.\u003c/p\u003e \u003cp\u003eTo assess the impact of these two features, we repeated the analysis using variations of B-GUT with or without decontamination (B-GUT vs BGUTc, respectively) and the standard or telomere-to-telomere human reference genome (GRCh38 vs T2T, respectively), in all combinations, namely GRCh38-BGUT, GRCh38-BGUTc, T2T_BGUT and T2T-BGUTc.\u003c/p\u003e \u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, fungal abundances (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA right) but not bacterial abundances (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA left) exhibit distinct density distributions across the different assembly-db combinations, indicating that the use of B-GUT without decontamination leads to higher estimated fungal abunances. This trend is also observed when comparing fungi/bacteria abundance ratios (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB). The effect in fungal identification of the use of the T2T reference for host removal was comparatively smaller when the non decontaminated database was used (GRCh38-B-GUTc vs T2T-B-GUTc, Wilcoxon p-value\u0026thinsp;=\u0026thinsp;0.04992).\u003c/p\u003e \u003cp\u003eWhen assessing per-sample differences in inferred fungal abundances across different combinations of host-depletion and reference database we confirmed a consistent trend towards higher fungal abundance in all combinations as compared to T2T-B-GUT, with the use of B-GUTc (non-contaminated version of the database) causing the most acute differences (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eC).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThese results, coupled to the results of our decontamination analysis explained above, indicates that the use of non-decontaminated fungal genomes as references leads to over-estimation of overall fungal content, likely resulting from mis-identification of bacterial reads as fungi. In addition, albeit with a comparative smaller effect, the use of a more comprehensive host read depletion using the T2T reference assembly further reduced the identification of fungal reads in the context of the decontaminated version of the database. We hypothesized that cross-mapping of repetitive human sequences (included in T2T but generally lacking GhC38 reference) to repetitive regions of fungal genomes may underlie this effect. To test this, we simulated reads from the T2T genome and mapped them to either T2T or the GhC38 reference. Whereas 100% mapping was achieved with the T2T we obtained an alignment rate of 96.8% on the GhC38 reference, indicating incomplete sequence representation. We extracted the T2T reads that were unmapped to GhC38 and analyzed them with kraken2 using either B-GUT or the standard database. With B-GUT we obtained 84.68% classified as human, 4.17% as Fungi and 9.53% as unclassified, suggesting at least a fraction of T2T-exclusive sequences can be classified as fungi. With the Standard database, which contains very few fungal genomes, only 0.09% of the reads were assigned to fungi, while 16.54% was classified as unclassified and 60.07% as Human. Surprisingly a significant percentage of these reads were assigned to the eukaryotic group Sar, specifically 7.16% were assigned to \u003cem\u003eToxoplasma gondii ME49\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). All in all, these results reinforce the need of using the T2T to deplete the human reads in order to avoid potential false positives.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe next examined the impact of different assembly-db combinations in downstream analysis of differentially abundant fungal species when comparing CRC and healthy samples. We obtained 16 differentially abundant fungal species with T2T-B-GUT, 18 with GRCh38-B-GUT, 5 with T2T-B-GUTc and 7 with GRCh38-B-GUT (Supplementary material Table S4). Hence, although the use of the decontaminated database reduces the overall estimation of fungal reads it results in the detection of more differentially abundant species, likely due to less dispersed and noisy mapping. Comparatively the use of a more comprehensive host depletion with T2T had a minor effect, with \u003cem\u003eByssothecium circinans, Friedmanniomyces simplex, Heterodoassansia hygrophilae\u003c/em\u003e and \u003cem\u003eCerren unicolor\u003c/em\u003e uniquely detected when using GhC38. Notably, we only detected \u003cem\u003eAspergillus\u003c/em\u003e species, in line with the reference study, when using the contaminated version of B-GUT: \u003cem\u003eAspergillus keveii\u003c/em\u003e and \u003cem\u003eAspergillus austroafricanus\u003c/em\u003e (in GRCh38-B-GUTc), and \u003cem\u003eAspergillus austroafricanus\u003c/em\u003e (in T2T-B-GUTc).\u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003eAccurate taxonomic assignment is crucial to deriving meaningful insights from metagenomic datasets. This step critically depends on the comprehensiveness and quality of the reference genome database used for sequence comparison. While prokaryotic microorganisms, which dominate the biomass in many ecosystems [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e], have been the primary focus of metagenomic studies, eukaryotic microorganisms, such as fungi, can exert significant roles within microbial communities despite their often lower abundance. Consequently, interest in the study of eukaryotes within the microbiome has grown in recent years. However, the limited representation of eukaryotic genomes in current reference databases hinders their investigation and restricts the discovery of potentially important ecological associations. To address this and other limitations, we have developed the Broad Gut Microbiome Database (B-GUT). This database offers three key advancements over the standard complete database provided by Kraken2: (i) it features a curated and human-gut-specific collection of prokaryotic genomes, (ii) it incorporates over 2,000 representative and contamination-screened fungal genomes, and (iii) it includes the T2T human reference genome to enhance the removal of repetitive host-derived sequences.A major finding of our study was the uncovering of a significant amount of bacterial sequences in publicly available fungal genomes, underscoring the need for decontamination. A key aspect of our work is the development and validation of a powerful decontamination pipeline. By integrating k-mer mapping, GC content analysis, and machine learning, this pipeline effectively identifies and removes contaminating contigs from assemblies. Application of this pipeline revealed various levels of contamination in publicly available datasets, highlighting potential sources of contamination in some cases. This decontamination tool could have future alternative uses including the sanity-check for newly assembled fungal genomes or the pre-filtering of fungal genomes before comparative genomics analyses.\u003c/p\u003e \u003cp\u003eThe newly developed B-GUT databases offered improved results over the standard database as assessed on two independent mock communities, particularly in fungi, but also in prokaryotic identification. Still some organisms of the mock community remained unidentified or misidentified, highlighting the need for future improvement. Lack of publicly available genome sequences or availability in different source databases were the major identified reasons underlying our false negatives.\u003c/p\u003e \u003cp\u003eAdditionally, we have showcased the use of B-GUT in a real-case scenario by reanalysing a published meta-analysis focused on the identification of CRC biomarkers. Our results are consistent with a high impact of the used reference database in metagenomics analysis [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e], and underscore the importance of using niche-specific databases, given that the use of B-GUT as compared to the standard database produced results of greater biological significance based on prior knowledge. Differences found not only affected overall results, but also the top-ranking species with regard to associations with the trait of interest (CRC in this case). Notable, the use of B-GUT uniquely identified \u003cem\u003ePeptostreptococcus stomatis\u003c/em\u003e as the differentially abundant taxon, which was undetected in the original study. \u003cem\u003eP. stomatis\u003c/em\u003e has been identified as a CRC biomarker in multiple studies [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e, \u003cspan additionalcitationids=\"CR44\" citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e] However, it is not included in the standard Kraken2 database and therefore missed by studies using this database. Thus, the incorporation of \u003cem\u003eP. stomatis\u003c/em\u003e as a feature in the machine learning models of [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] could potentially enhance the predictive accuracy.\u003c/p\u003e \u003cp\u003eWith respect to the identification of fungal microorganisms, the use of B-GUT led to the detection of a higher number of differentially abundant fungi compared to the standard database. The detected fungal species had higher biological relevance according to previous scientific articles, as compared to those identified using the standard database or in the reference study [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], which also used an extensive, albeit not curated, fungal database. Of note, our study uniquely detected two species previously found associated with CRC: \u003cem\u003eSaccharomyces cerevisiae\u003c/em\u003e and \u003cem\u003eKluyveromyces marxianus. S. cerevisiae\u003c/em\u003e, which is an important component of the gut microbiome, was previously reported as depleted in CRC, as in our study [\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e] and have been shown to have protective effects for CRC [\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e, \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e]. \u003cem\u003eK. marxianus\u003c/em\u003e has been used as a probiotic, and has also been shown to have potential beneficial roles for CRC [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e, \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e] .\u003c/p\u003e \u003cp\u003eOur result, directly comparing decontaminated to non decontaminated versions of B-GUT clearly showed that lack of decontamination leads to over-estimation of fungal content, likely resulting from bacterial reads mapping to contaminated contigs in fungal reference genomes. Of note, many of the differences between our analysis and the reference study [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e] concerned the identification of \u003cem\u003eAspergillus\u003c/em\u003e species as differentially abundant only in the latter. Genomes from this genus were not flagged as potentially contaminated in our curation of the database, so we conclude that the assignments are likely real. However, the presence of contaminants in other fungal species may have influenced the overall composition and thus the differential abundant results. This effect suggests that certain contaminants, even if limited in number, may skew normalized abundance comparisons and impact statistical significance. Further research is required to clarify these indirect effects.\u003c/p\u003e \u003cp\u003eAnother innovation of our approach was the use of T2T human reference genome assembly to remove host reads, which has been previously shown to improve metagenomics analysis [\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e]. Our results show an overall significant impact of this addition in terms of fungal assignments when the decontamination database was used, suggesting the impact of contamination overrides that of the host removal approach. However, cross-mapping analyses showed that human reads removed uniquely by T2T do cross-map to other species, including fungi and protists such as \u003cem\u003eToxoplasma gondii\u003c/em\u003e, confirming the relevance of using this host-removal approach to avoid false positives.\u003c/p\u003e \u003cp\u003eOur study has some limitations, such as the fact that our decontamination pipeline has a strict filter that is also removing contigs from the genomes that are classified as \u003cem\u003eunknown\u003c/em\u003e, and might be discarding fungal contigs as it might happen in the case of \u003cem\u003eYarrowia lipolytica\u003c/em\u003e. The case of \u003cem\u003eYarrowia lipolytica\u003c/em\u003e was also observed with eight genomes from the same database (JGI, mycocosm) that were completely classified as unknown by tiara because the presence soft-masking: \u003cem\u003eArthroderma benhamiae\u003c/em\u003e, \u003cem\u003eAspergillus clavatus, Aspergillus nidulans, Cladosporium sphaerospermum, Kluyveromyces lactis, Neosartorya fischeri, Podospora anserina\u003c/em\u003e and \u003cem\u003eTrichophyton verrucosum\u003c/em\u003e. Of note the download of a soft-masked version resulted from a file naming error in the source database, which attest for the importance of database curation. Nevertheless, not including these genomes is not critical for our purpose of having a broad fungal database to study the gut microbiome because they are not gut related and we have other species representing the corresponding genus. Furthermore, including the contigs classified as unknown might improve the inclusion of fungi but also has the risk of increasing the number of false positives.\u003c/p\u003e \u003cp\u003eAlthough B-GUT has been specifically built for the kraken2 tool, the input and decontamination tool used can be applied to build any k-mer based database index (e.g for Centrifuge [\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e], Clark[\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e], K-Slam [\u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e], etc).\u003c/p\u003e"},{"header":"CONCLUSIONS","content":"\u003cp\u003eAll in all, our findings emphasize the importance of using well-curated, niche-specific databases for taxonomic assignment in metagenomic studies. By carefully evaluating and refining the genomes included in these databases, we can enhance the accuracy and biological relevance of the results, particularly in the identification of disease-associated biomarkers.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003e\u003cu\u003eAUTHORS\u0026rsquo;S CONTRIBUTION\u003c/u\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTG conceptualized and supervised \u0026nbsp;the study. OKL performed all the analysis. All the authors designed the analysis, discussed the results and wrote \u0026nbsp;the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cu\u003eFUNDING\u003c/u\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTG group acknowledges support from the Spanish Ministry of Science and Innovation (grant numbers \u0026nbsp;PID2021-126067NB-I00, CPP2021-008552, PCI2022-135066-2, PLEC2023-010225, and PDC2022-133266-I00), cofounded by ERDF \u0026ldquo;A way of making Europe\u0026rdquo;, as well as support from the Catalan Research Agency (AGAUR) (grant number SGR01551); Gordon and Betty Moore Foundation (grant number GBMF9742); \u0026ldquo;La Caixa\u0026rdquo; foundation (grant number LCF/PR/HR21/00737), Fundaci\u0026oacute; La Marat\u0026oacute; de TV3 (202328-31), AECC (PRYGN234923GABA), and Instituto de Salud Carlos III (IMPACT grant IMP/00019 and CIBERINFEC CB21/13/00061- ISCIII-SGEFI/ERDF). \u0026nbsp;OKL is supported by the Formaci\u0026oacute;n de profesorado universitario (FPU) program from the Spanish Ministerio de Universidades (FPU2020-02907).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cu\u003eAvailability of data and materials\u003c/u\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePreprocessing pipeline and decontamination pipeline code is available on our GitHub repository :https://github.com/Gabaldonlab/meTAline \u0026nbsp; \u0026nbsp;and https://github.com/Gabaldonlab/B-GUT-decontamination\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eNewly sequenced mock community is available at \u0026nbsp;PRJNA1266093. This will be available upon publication but a reviewer\u0026rsquo;s link has been created: https://dataview.ncbi.nlm.nih.gov/object/PRJNA1266093?reviewer=kgb8ai7pg6emauesah1fm1ov1g\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eFungal mock community has been downloaded from SRX10705695 [23].\u003c/p\u003e\n\u003cp\u003eMetagenomics data for the meta-analysis has been downloaded from the corresponding bioprojects: PRJNA447983 [24], PRJEB27928 [25], PRJDB4176 [26] PRJEB10878 [27]\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eB-GUT database can be downloaded from the Phylomedb FTP server, by first connecting to the ftp with the command \u0026ldquo;ftp phylomedb.org\u0026rdquo;. This will ask the user name, which is \u0026ldquo;anonymous\u0026rdquo; and the Password in which the user should not type anything, just hit the enter key on the keyboard. Once successfully logged in, move to the B-GUT folder by the command \u0026ldquo;cd B-GUT\u0026rdquo; to see the corresponding database files.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eStavrou AA. Misidentification of Genome Assemblies in Public Databases: the Case of Naumovozyma Dairenensis and Proposal of a Protocol to Correct Misidentifications. 2017.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLupo V, Van Vlierberghe M, Vanderschuren H, Kerff F, Baurain D, Cornet L. Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics. Front Microbiol. 2021;12:755101.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlmeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39:105\u0026ndash;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi W, Liang H, Lin X, Hu T, Wu Z, He W, et al. A catalog of bacterial reference genomes from cultivated human oral bacteria. npj Biofilms and Microbiomes. 2023;9:1\u0026ndash;13.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen T, Yu W-H, Izard J, Baranova OV, Lakshmanan A, Dewhirst FE. The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information. Database: The Journal of Biological Databases and Curation. 2010;2010:baq013.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBelvoncikova P, Splichalova P, Videnska P, Gardlik R. The Human Mycobiome: Colonization, Composition and the Role in Health and Disease. J Fungi (Basel). 2022;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBahram M, Netherway T. Fungi as mediators linking organisms and ecosystems. FEMS Microbiol Rev. 2022;46.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGao R, Xia K, Wu M, Zhong H, Sun J, Zhu Y, et al. Alterations of Gut Mycobiota Profiles in Adenoma and Colorectal Cancer. Front Cell Infect Microbiol. 2022;12:839435.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin Y, Lau HC-H, Liu Y, Kang X, Wang Y, Ting NL-N, et al. Altered Mycobiota Signatures and Enriched Pathogenic Aspergillus rambellii Are Associated With Colorectal Cancer Based on Multicohort Fecal Metagenomic Analyses. Gastroenterology. 2022;163:908\u0026ndash;21.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:1\u0026ndash;13.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHiseni P, Rudi K, Wilson RC, Hegge FT, Snipen L. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome. 2021;9:165.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eO\u0026rsquo;Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733\u0026ndash;45.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChibani CM, Mahnert A, Borrel G, Almeida A, Werner A, Brug\u0026egrave;re J-F, et al. A catalogue of 1,167 genomes from the human gut archaeome. Nat Microbiol. 2022;7:48\u0026ndash;61.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376:44\u0026ndash;53.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEnsembl Fungi. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://fungi.ensembl.org/index.html\u003c/span\u003e\u003cspan address=\"https://fungi.ensembl.org/index.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed 3 Jul 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFungiDB. https://fungidb.org/. Accessed 3 Jul 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMycocosm. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://mycocosm.jgi.doe.gov/mycocosm/home\u003c/span\u003e\u003cspan address=\"https://mycocosm.jgi.doe.gov/mycocosm/home\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed 3 Jul 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRefSeq: NCBI Reference Sequence Database. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.ncbi.nlm.nih.gov/refseq/\u003c/span\u003e\u003cspan address=\"https://www.ncbi.nlm.nih.gov/refseq/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed 3 Jul 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eManni M, Berkeley MR, Seppey M, Sim\u0026atilde;o FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 2021;38:4647\u0026ndash;54.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKarlicki M, Antonowicz S, Karnkowska A. Tiara: deep learning-based classification system for eukaryotic sequences. Bioinformatics. 2022;38:344\u0026ndash;50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang F, Sun J, Luo H, Ren H, Zhou H, Lin Y, et al. Assessment of fecal DNA extraction protocols for metagenomic studies. GigaScience. 2020;9:giaa071.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHu Y, Irinyi L, Hoang MTV, Eenjes T, Graetz A, Stone EA, et al. Inferring Species Compositions of Complex Fungal Communities from Long- and Short-Read Sequence Data. MBio. 2022;13:e0244421.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThomas AM, Manghi P, Asnicar F, Pasolli E, Armanini F, Zolfo M, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med. 2019;25:667\u0026ndash;78.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med. 2019;25:679\u0026ndash;89.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYachida S, Mizutani S, Shiroma H, Shiba S, Nakajima T, Sakamoto T, et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat Med. 2019;25:968\u0026ndash;76.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu J, Feng Q, Wong SH, Zhang D, Liang QY, Qin Y, et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 2017;66:70\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHannigan GD, Duhaime MB, Ruffin MT 4th, Koumpouras CC, Schloss PD. Diagnostic Potential and Interactive Dynamics of the Colorectal Cancer Virome. MBio. 2018;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8:e61217.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou H, He K, Chen J, Zhang X. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol. 2022;23:1\u0026ndash;23.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee S, Portlock T, Le Chatelier E, Garcia-Guevara F, Clasen F, O\u0026ntilde;ate FP, et al. Global compositional and functional states of the human gut microbiome in health and disease. Genome Res. 2024;34:967\u0026ndash;78.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJanssens Y, Nielandt J, Bronselaer A, Debunne N, Verbeke F, Wynendaele E, et al. Disbiome database: linking the microbiome to disease. BMC Microbiol. 2018;18:50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVesty A, Biswas K, Taylor MW, Gear K, Douglas RG. Evaluating the Impact of DNA Extraction Method on the Representation of Human Oral Bacterial and Fungal Communities. PLoS One. 2017;12:e0169877.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWhiston E, Taylor JW. Comparative Phylogenomics of Pathogenic and Nonpathogenic Species. G3 Genes|Genomes|Genetics. 2016;6:235\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOsman MA, Neoh H-M, Ab Mutalib N-S, Chin S-F, Mazlan L, Raja Ali RA, et al. Parvimonas micra, Peptostreptococcus stomatis, Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker panel of colorectal cancer. Sci Rep. 2021;11:1\u0026ndash;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eConde-P\u0026eacute;rez K, Aja-Macaya P, Buetas E, Trigo-Tasende N, Nasser-Ali M, Rumbo-Feal S, et al. The multispecies microbial cluster of Fusobacterium, Parvimonas, Bacteroides and Faecalibacterium as a precision biomarker for colorectal cancer diagnosis. Mol Oncol. 2024;18:1093\u0026ndash;122.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSenthakumaran T, Tann\u0026aelig;s TM, Moen AEF, Brackmann SA, Jahanlu D, Rounge TB, et al. Detection of colorectal-cancer-associated bacterial taxa in fecal samples using next-generation sequencing and 19 newly established qPCR assays. Mol Oncol. 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/1878-0261.13700\u003c/span\u003e\u003cspan address=\"10.1002/1878-0261.13700\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang P, Ji F, Cheung AH-K, Fu K, Zhou Q, Ding X, et al. Peptostreptococcus stomatis promotes colonic tumorigenesis and receptor tyrosine kinase inhibitor resistance by activating ERBB2-MAPK. Cell Host Microbe. 2024;32:1365\u0026ndash;79.e10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSambrani R, Abdolalizadeh J, Kohan L, Jafari B. Saccharomyces cerevisiae inhibits growth and metastasis and stimulates apoptosis in HT-29 colorectal cancer cell line. Comparative Clinical Pathology. 2018;28:985\u0026ndash;95.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFortin O, Aguilar-Uscanga B, Vu KD, Salmieri S, Lacroix M. Cancer Chemopreventive, Antiproliferative, and Superoxide Anion Scavenging Properties of Kluyveromyces marxianus and Saccharomyces cerevisiae var. boulardii Cell Wall Components. Nutr Cancer. 2018;70:83\u0026ndash;96.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBar-On YM, Phillips R, Milo R. The biomass distribution on Earth. Proc Natl Acad Sci U S A. 2018;115:6506\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmith RH, Glendinning L, Walker AW, Watson M. Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome. Anim Microbiome. 2022;4:57.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen X, Li J, Li J, Zhang Y, Li X, Cui Y, et al. Fecal -- Biomarker for Noninvasive Diagnosis and Prognosis of Colorectal Laterally Spreading Tumor. Front Oncol. 2021;11:661048.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDai W, Li C, Li T, Hu J, Zhang H. Super-taxon in human microbiome are identified to be associated with colorectal cancer. BMC Bioinformatics. 2022;23:1\u0026ndash;18.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAvuthu N, Guda C. Meta-Analysis of Altered Gut Microbiota Reveals Microbial and Metabolic Biomarkers for Colorectal Cancer. Microbiol Spectr. 2022;10:e0001322.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCoker OO, Nakatsu G, Dai RZ, Wu WKK, Wong SH, Ng SC, et al. Enteric fungal microbiota dysbiosis and ecological alterations in colorectal cancer. Gut. 2019;68:654\u0026ndash;62.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi JQ, Li JL, Xie YH, Wang Y, Shen XN, Qian Y, et al. Saccharomyces cerevisiae may serve as a probiotic in colorectal cancer by promoting cancer cell apoptosis. J Dig Dis. 2020;21:571\u0026ndash;82.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang M, Gao C, Lessing DJ, Chu W. Saccharomyces cerevisiae SC-2201 Attenuates AOM/DSS-Induced Colorectal Cancer by Modulating the Gut Microbiome and Blocking Proinflammatory Mediators. Probiotics Antimicrob Proteins. 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s12602-024-10228-0\u003c/span\u003e\u003cspan address=\"10.1007/s12602-024-10228-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNag D, Goel A, Padwad Y, Singh D. In Vitro Characterisation Revealed Himalayan Dairy Kluyveromyces marxianus PCH397 as Potential Probiotic with Therapeutic Properties. Probiotics Antimicrob Proteins. 2023;15:761\u0026ndash;73.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang L, Xing G. Telomere-to-Telomere Assembly Improves Host Reads Removal in Metagenomic High-Throughput Sequencing of Human Samples. bioRxiv. 2023;:2023.05.05.539517.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOunit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16:236.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAinsworth D, Sternberg MJE, Raczy C, Butcher SA. k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets. Nucleic Acids Res. 2017;45:1649\u0026ndash;56.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Metagenomics, Niche-specific database, Fungi, biomarker discovery, colorectal cancer","lastPublishedDoi":"10.21203/rs.3.rs-6766778/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6766778/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eAccurate taxonomy assignment to sequencing reads is a key step in metagenomic studies, impacting all downstream analyses. The accuracy of this step critically depends on the quality and comprehensiveness of the used reference genome database. While fungi are ubiquitous and relevant in the human microbiome, they are generally poorly represented in current databases. To address this and other limitations, we developed B-GUT, a custom Kraken2 database that integrates i) a broad and curated collection of 2,110 fungal reference genomes; ii) the human telomere-to-telomere reference genome; and iii) two available curated databases for gut-specific bacterial and archaeal genomes.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eOur analysis of publicly available fungal genomes revealed significant contamination and substantial cross-mapping of human sequencing reads to fungal genome references, underscoring the necessity of rigorous curation and accurate host read filtering. We validated our genome curation pipeline and the resulting B-GUT database using mock microbial communities with known compositions. Finally, we showcased the utility of B-GUT by re-analysing data from a published colorectal cancer metagenomics study, where its use led to significantly improved results, providing more precise taxonomic assignments and a more accurate identification of differentially abundant taxa for both bacterial and fungal communities.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eWe introduce B-GUT, a reference genome database centered on the gut microbiome, featuring a uniquely curated and comprehensive collection of fungal genomes, often underrepresented in existing resources. We demonstrate the importance of database curation and the enhanced capacity of B-GUT for identifying biologically relevant microbes.\u003c/p\u003e","manuscriptTitle":"B-GUT reference genome database improves biomarker discovery and fungal identification in gut metagenomes","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-26 13:57:29","doi":"10.21203/rs.3.rs-6766778/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6e7f0a15-6c40-4712-a2b0-dfea7dcd44f3","owner":[],"postedDate":"June 26th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-06-30T09:53:43+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-26 13:57:29","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6766778","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6766778","identity":"rs-6766778","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.