Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

doi:10.21203/rs.3.rs-7581938/v1

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

2026 · doi:10.21203/rs.3.rs-7581938/v1

preprint OA: closed

Full text JSON View at publisher

Full text 198,265 characters · extracted from preprint-html · click to expand

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world Grazia Visci, Elisabetta Notario, Giuseppe Defazio, Mariano Francesco Caratozzolo, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7581938/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 8 You are reading this latest preprint version Abstract Background Two culture-independent methods, amplicon-based sequencing and shotgun metagenomics, have significantly advanced the study of microbial communities. To date, short-read sequencing technologies have enabled high accuracy and deep coverage, while long-read sequencing approaches are increasingly being applied to improve genome assembly, despite challenges related to sequencing accuracy and nucleic acids input requirements. In this benchmark study, we compared the shotgun metagenomics approach across three sequencing technologies, Illumina (short reads), PacBio and Nanopore (long reads), using a commercial microbial community consisting of 20 known species. Specifically, we evaluated the effectiveness of the data generated by each platform in reconstructing and identifying specific known taxa, as well as in understanding their genetic and functional potential, considering annotated genes, length of predicted proteins and number/types of inferred functions. Results Illumina sequencing provided high-throughput and high-quality data, but its limited read length precluded complete genome assembly. This affected the functional analysis, leading to an underestimation of the coding and non-coding genes. Nanopore sequencing yielded the longest reads, resulting in more contiguous assemblies, although it was impacted by higher error rates and the choice of assembly method. PacBio offered the best balance between read length and base accuracy, but with a lower number of reads. This affected genome coverage for a few specific taxa, influencing the quality of their assemblies, the completeness of MAGs (Metagenome Assembled Genomes), and the accuracy of functional annotation. Nevertheless, PacBio successfully retrieved MAGs for all mock community species, and the genomes annotation was consistent with the reference. Conclusions This study offers a valuable framework to guide the selection of sequencing strategies in metagenomic research. Understanding the strengths and limitations of each step of metagenomic workflows, from library preparation to bioinformatic analysis, is crucial for driving its ongoing optimization. Shotgun metagenomics microbiome next-generation sequencing third-generation sequencing MAGs functional analysis mock analysis Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 1. INTRODUCTION Exploring the taxonomic and functional biodiversity of microbial communities is essential for understanding ecosystem complexity, considering both the organisms and their roles. Microbial communities largely populate environmental or host-related niches and include bacteria, archaea, fungi, protists and viruses. As traditional approaches, relying on isolation in culture of microorganisms, principally prokaryotes, may uncover only about 1% of microbial biodiversity ( 1 ), DNA-sequencing based technologies have represented a revolutionary breakthrough. In the last two decades, high throughput sequencing technologies (HTS) have significantly enhanced our understanding of microbial communities and their essential roles in ecosystems as well as in human, animal, and plant health ( 2 – 4 ), paving the way to the so-called metagenomics approaches, such as amplicon-based (or DNA-metabarcoding) and shotgun metagenomics. Amplicon-based metagenomics relies on the selective amplification and sequencing of specific target genes (i.e. 16S or 18S rRNA genes, ITS) to obtain the taxonomic profile of microbial communities, while shotgun metagenomics allows for the random sequencing of the entire genetic content of these communities providing not only taxonomic but also functional information ( 2 , 5 ). Both methods are valuable for studying and characterizing microbiomes, each offering distinct advantages and being chosen based on the specific research question as well as cost considerations ( 3 , 5 ). However, while DNA barcodes can range from 100 to 1,600 bp in length, a prokaryotic genome, for example, has an average size around 5 Mbp, making intuitively shotgun metagenomics the most informative approach. Moreover, findings of shotgun metagenomics studies suggest that various microbiome interactions, such as horizontal gene transfer, genetic content networks or microbiota-dependent metabolites, can have significant implications for the host-microbiome relationship ( 6 – 8 ). The ability to explore these interactions in addition to the taxonomic assignment, allows to shed a light on the human microbiome in both health and disease contexts, unveiling the molecular drivers for diseases, spread of antibiotic resistance, disease-associated genetic elements, individual’s health and resilience ( 6 – 8 ). Not less relevant are the implications in other environmental contexts, helping to understand why some ecosystem functions are more susceptible than others to successful modification and sustainability ( 9 , 10 ). Shotgun metagenomics enables higher-resolution profiling of a community, including the possibility of taxonomic assignment even at the strain level, identification of unknown species through de novo assembly, the study of gene content, function, and genomic plasticity ( 11 – 13 ). Nevertheless, genome assembly poses significant technical challenges, due to the complexity of assembling individual bacterial genomes from mixed sequences ( 14 ). The reconstruction of nearly complete genomes as MAGs (Metagenome-assembled Genomes) through assembly and binning ( 15 ), could enable the identification of new taxa, genes and metabolic pathways. However, genomes from different bacteria may share highly similar regions, and only when fully assembled to form un-gapped and circularized genomes are comparable to genomes obtained from isolated and pure culture deepening their classification up to the strain level ( 16 ). Moreover, an additional challenge arises when dealing with uncultured and uncultivable organisms that have not been sequenced and are therefore absent from reference genome databases. This leads to an increased proportion of reads that cannot be aligned and are consequently classified as unassigned ( 14 ). Even from an experimental point of view, shotgun metagenomics approach might face limitations. Low-DNA concentration of metagenomic samples may lead to the use of amplification protocols, increasing the experimental bias ( 17 , 2 ). Additionally, host-DNA interference can reduce the sensitivity in detecting low-abundance bacterial species. To mitigate this issue, higher sequencing depth is required, which in turn increases overall sequencing costs to achieve adequate microbial genome coverage ( 18 – 20 ). Indeed, a more in-depth characterization of microbial communities requires HTS platforms, that include short- and long-read sequencing technologies, able to produce large amounts of data. The short-reads sequencing has dominated microbiome studies until now, thanks to its high quality reads, low-input protocols and high coverage ( 21 – 25 ), even if needs suitable fragmentation and amplification steps ( 26 ). On the other side, the long-read sequencing technologies can yield long and ultra-long reads directly from single DNA molecules, despite the low per base accuracy and the higher amount of DNA input required ( 19 , 21 , 27 – 29 ). Several benchmark studies have been conducted over time, comparing second- and third-generation sequencing platforms ( 21 , 28 – 30 ). These studies serve as essential resources for researchers to better understand the advantages and limitations of each technology. Recently, in metagenomics, the focus has been on taxonomic profiling and meta-assembly ( 29 ). In this study, we applied the shotgun metagenomics approach to a microbial community with known composition (mock community), leveraging three distinct sequencing platforms to evaluate their performance and limitations. Specifically, we adopted Illumina NovaSeq 6000 for short-read sequencing, alongside PacBio Sequel System IIe and Nanopore GridION for long-read sequencing. Using a commercially available prokaryotic mock community, we established a controlled benchmarking framework to assess sequencing accuracy, coverage and assembly efficiency. Beyond evaluating the limitations of individual sequencing protocols and the ability of different assemblers to reconstruct high-quality MAGs, a key added value of this study is the investigation of microbial taxonomic assignment and gene annotation derived from the recovered genomes. Here, we provide a test case using standards characterized by intra- and inter-species diversity, which not only outlines the strengths and weaknesses of shotgun metagenomics, but also demonstrates how current technologies can be leveraged to infer multiple aspects of the microbiome's “dark matter”. 2. MATERIALS AND METHODS 2.1. Mock community sample The commercial mock microbial community, ATCC® 20 Strain Even Mix Genomic Material (MSA-1002™, ATCC®, USA, https://www.atcc.org/products/msa-1002 ), was used as a benchmark for the shotgun metagenomic study. It is composed of a mix of genomic DNA belonging to 20 fully sequenced, characterized, and authenticated ATCC Genuine Cultures (5% for each strain) ( Supplementary Table S1 ). Fluorometric quantification and genome quality were assessed with dsDNA HS assay for Qubit (ThermoFisher Scientific, Waltham, MA, USA) and Genomic DNA 165 kb Kit for Femto Pulse System (Agilent, Santa Clara, CA, USA) ( Supplementary Fig. 1A ), respectively. The total yield of the single commercial purchased sample was about 200 ng. Thus, three mix of genomic DNAs, from the same production batch (Lot. 70001383) were used for the different applications as specified below. 2.2. WGS library preparation and sequencing Different shotgun metagenomic protocols and sequencing platforms were used in this study, as described below. The mock community DNA was used as input for library preparation and sequenced on NovaSeq6000 (Illumina, San Diego, California, USA), GridION (Oxford Nanopore, Oxford, UK) and Sequel System IIe (PacBio, Menlo Park, California, USA). We used the same input DNA amount as a normalization factor for cross-platform comparison. Moreover, one run unit was assigned per platform—one flow cell for Illumina and ONT, and one SMRT Cell for PacBio. The mock sample was either multiplexed with other samples or run individually to maximize flow cell capacity. For each platform, the sequencing output and the number of samples multiplexed per FlowCell/SMRT cell are specified below. Illumina Sequencing Illumina DNA Prep kit was used, starting from 200 ng DNA of the mock community, following the protocol instruction ( https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/illumina_prep/illumina-dna-prep-reference-guide-1000000025416-09.pdf ). The protocol uses bead-linked transposases to tagment DNA, generating an insert size of ~ 350bp, and then includes a step of amplification of the tagmented DNA. All the libraries were quality checked through High Sensitivity DNA Assay for 2100 Agilent Bioanalyzer (Agilent, Santa Clara, CA, USA) and quantified using the Qubit dsDNA HS assay (Thermo Fisher Scientific, Waltham, MA, USA). The library was sequenced on the Novaseq 6000 Illumina platform with the 2 × 150 bp paired-end sequencing layout (NovaSeq 6000 S4 Reagent Kit v1.5–300 cycles). The mock sample was loaded in multiplexing with 60 other samples in order to maximize the sequencing capacity of the single S4 flow cell (maximum flow cell output 3Tb). Approximately 16.3 Gb were produced from the mock sample. Nanopore Sequencing About 200 ng of DNA were used as input for Genomic DNA Ligation Sequencing kit (ONT SQK-LSK114) ( https://nanoporetech.com/document/genomic-dna-by-ligation-sqk-lsk114?device=GridION ) and sequenced on GridION platform. This protocol allows DNA sequencing without fragmentation and amplification steps. The sample was loaded individually on a single MinION Flow Cell (R10.4.1, maximum flow cell output 50Gb). Nanopore sequencing produced 2.8 Gb for the mock sample. PacBio Sequencing Library preparation was performed following PacBio procedure and checklist: “Preparing whole genome and metagenome libraries using SMRTbell® prep kit 3.0” (PN 102-166-600 - APR2022) starting from about 200 ng of fragmented DNA. According to the manufacturer's instructions, the DNA of the mock community was sheared with 35 speeds by the Megaruptor®3 (Hologic, Inc). The genomic profile of fragmented DNA was assessed with Genomic DNA 165 kb Kit for Femto Pulse System (Agilent, Santa Clara, CA, USA) ( Supplementary Fig. 1B ). Then, Binding kit 2.2, Internal control 1.0 and Sequel II Sequencing Kit 2.0 were used for sequencing on the PacBio Sequel IIe System. The mock sample was sequenced with 5 multiplexed samples on a single SMRT® Cell 8M (maximum SMRT cell output 30Gb). PacBio sequencing produced approximately 0.8 Gb from the mock sample. 2.3. Raw data trimming, assembly and, mapping on reference genomes Illumina data analysis and assembly Illumina raw sequencing data were initially quality checked by using FastQC (v0.11.9) and low-quality reads were trimmed by using trimmomatic (v0.39, PE ILLUMINACLIP LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 ) ( 31 ). Trimmed data were assembled by using two alternative approaches: megaHIT (v1.2.9, --k-list 21, 29, 39, 59, 79, 99, 119, 141 --k-step 10 --min_count 2) ( 32 ) and metaSPAdes (v3.15.5, --meta -k 21,29,39,59,79,99,119 -m 500 --phred-offset 33 ) ( 33 ). Nanopore data analysis and assembly Raw Nanopore sequencing data were initially quality checked by using pycoQC ( 34 ). Porechop abi (v0.5.0, --ab_initio --format fastq.gz) ( 35 ) was used to identify and trim adapter sequences. Trimmed data were assembled by using metaFlye (v 2.9.2-b1786, --nano-raw –meta -i 5) ( 36 ) and metaMDBG (v1.0, asm –in-ont) ( 37 ). PacBio data analysis and assembly PacBio HiFi data were initially quality checked by using FastQC (v0.11.9). Then, cutadapt (v4.5 ,--overlap 35 -e 0.1 --discard -j 5 --revcomp) ( 38 ) was applied to check the HiFi reads for adapter presence. HiFi reads containing adapters were discarded and excluded for subsequent analysis. Trimmed data were assembled by using metaFlye (v2.9.2-b1786, --pacbio-hifi --meta -i 5) ( 36 ) and metaMDBG (v1.0, asm –in-hifi) ( 37 ). 2.4 Mapping on reference genomes and reference coverage Sequencing data were mapped on the 20 prokaryotic strain genomes by using minimap2 (v2.26-r1175). The following presets were applied: Illumina (-ax sr), Nanopore (-ax map-ont -L), and PacBio (-ax map-hifi -L). Through samtools (v1.3.1), sam files were compressed as bam files and sorted. Finally, sorted bam files were used to measure genome coverage through the samtools coverage function (-ff 1284, to exclude unmapped reads and secondary alignments, -d 0, to avoid any limits in coverage counts). 2.5 Assembly evaluation, binning and bin refinement The obtained assemblies were evaluated by using metaQUAST (v5.2.0, default parameters) ( 39 ) with the -r option to map contigs and reads on reference genomes. Seqkit (v2.8.2, stats -j10 -t -a) ( 40 ) was used to retrieve overall data of the obtained contigs. Regardless of the sequencing and assembly approach, the obtained contigs were binned and the obtained bin refined by using metaWRAP ( 41 ). Initial binning was performed by using metaBAT2 (v2.12.1, min Contig length 1500) ( 42 ), MaxBin2 (2.2.4, min Contig length 1000) ( 43 ) and CONCOT (v1.0.0, min Contig length 1000) ( 44 ). Contextual to binning refinement process, inferred MAGs were quality checked by using CheckM (v1.0.18) ( 45 ) and genomes with a completeness ≥ 90% and contamination ≤ 5% were marked as High quality, completeness ≥ 50% and contamination ≤ 10% as medium, otherwise low quality ( 15 ). 2.6 MAGs comparison to reference genomes Obtained MAGs were compared to the reference genome by using MASH (v2.3, sketch -k 21 -s 15000) ( 46 ). Both reference genomes and MAGs were sketched in 15,000 minhash and “all versus all” comparisons were performed. Moreover, a phylogenetic comparison of the obtained MAGs was obtained by using GTDB-tk (v2.1.1) ( 47 ). Finally, MAGs were taxonomically classified by using kMetaShot (v2.0, default options) ( 16 ). 2.7 MAGs Dereplication The obtained MAGs with at least medium overall quality and the ATCC reference genomes were dereplicated by using the dRep (v3.5.0, dereplicate --ignoreGenomeQuality --genomeInfo) tool ( 48 ). Considering the presence of two pairs of co-generic species in the employed mock, the Ward algorithm ( 49 ) for hierarchical clustering was applied, to minimize the within-cluster variance. 2.8 MAGs Genes Annotation Annotation of both inferred MAGs with at least medium overall quality and Reference genomes downloaded from ATCC ( https://www.atcc.org/ , accessed on 1 March 2021) was performed by using Bakta (v1.4.0, --min-contig-length 200) ( 50 ). The proteins length profile inferred in MAGs was compared to those obtained in ATCC reference genomes by performing pairwise Wilcoxon test. Proteins labelled as hypothetical were excluded from these comparisons. Annotated protein products were compared between reference genomes and obtained MAGs both qualitatively, by numbering the number of common predicted protein function and those private in reference and MAGs, and quantitively by measuring the Jaccard distance. Jaccard distance was measured by using an in house develop Python script. 2.9 MAGs quantification Considering the variability in genome size of the species in the mock community ( Supplementary Table S1 ) and the fact an equal amount of genomic DNA was added to the mix, the number of expected genomes copies was estimated to infer the expected relative abundances. We estimated the mass of each genome in ng (nGM i ), by considering the average weight of a base pair in dsDNA is 607.4 g/mol. $$\:{nGM}_{i}=\:\frac{{GenomeLength}_{i}*607.4\:\left(\frac{g}{mol}\right)}{6.022*{10}^{23}\:\left({mol}^{-1}\right)}*{10}^{9}$$ Finally, we estimated the Genome Copy Number for each species i (GCN i ) considering the same amount of genomic DNA was added to the mixture (i.e. 10 ng): $$\:{GCH}_{i}=\frac{10\:}{{nGM}_{i}}$$ The estimated genomic copies per each species and the corresponding relative abundances are shown in Supplementary Table S1 . Following, trimmed reads were mapped on the obtained MAGs by using minimap 2 (using the same options listed in section 2.4 ). MAGs coverage was estimated by using the samtools coverage function (-ff 1284, to exclude unmapped reads and secondary alignments, -d 0, to avoid any limits in coverage counts). 3. RESULTS 3. 1 Sequencing throughput The sequencing data obtained for each technology (Illumina, Nanopore and PacBio), as pre- and post- adapter trimming, are shown in Table 1 . Table 1 Statistics regarding raw and trimmed sequencing data. For each sequencing technology the following data are shown: i) N. of seqs: number of produced reads; ii) Yield: total throughput in bases; iii) Min len: minimum sequence length in bp; iv) Avg len: average read length in bp; v) Median len: median read length; vi) Max len.: maximum sequence length in bp. Seq technologies N. of reads Yield Min len Avg len Median len Max len Illumina raw data 115,746,688 16,353,634,835 35 141.3 151 151 trimmed data 108,028,131 15,066,924,022 50 139.5 151 151 Nanopore raw data 2,301,340 2,806,148,731 5 1,219.4 489 918,116 trimmed data 2,299,453 2,724,939,449 1 1,185 455 918,116 PacBio raw data 111,629 805,182,430 258 7,213 6737 24,210 trimmed data 111,626 805,164,864 258 7,213.1 6737 24,210 Both Nanopore and PacBio produced longer reads than Illumina with an average length of 1 and 7 kb, respectively. Nanopore produced the longest read, spanning about 0.9Mb. Considering the amount of retained sequences/bases, 93.33%/92.13% passed the trimming step for Illumina sequencing, while 99.91%/97.10% for Nanopore and 99.99%/99.99% for PacBio. 3.2 Genome coverage Before performing the assembly, trimmed reads were mapped on reference genomes to evaluate the average coverage and sequencing depth. Initially we estimated the mean coverage of reference genomes (Fig. 1 ). The three sequencing technologies produced a variable sequencing depth, with Illumina yielding the highest results (median 441.01, IQR = 152.17, mean 1280.96), two orders of magnitude higher than Nanopore (median 29.91, IQR = 106.71, mean 88.56) and PacBio (median 12.84, IQR = 5.7, mean 18.43). Moreover, all the mock genomes were completely covered by Illumina (median 100, IQR = 0, mean 94.43) and Nanopore (median 100, IQR = 0, mean 94.43) (Fig. 1 ). PacBio (median 100, IQR = 0, mean 93.43) fully covered 19 out of 20 genomes (Fig. 1 ). Specifically, Schaalia odontolytica achieved an overall coverage and depth around 92.4% and 2.7X in PacBio sequencing, in contrast to 384x and 28X obtained from Illumina and Nanopore sequencing, respectively ( Supplementary Table S2 ). Moreover, we also evaluated the coverage breadth by measuring the proportion of reference genomes covered 20X, 30X, 40X and 50X (Fig. 2 ). By using Illumina sequencing, all the genomes were completely covered at 50X. When considering long read sequencing, only Escherichia coli with Nanopore was completely covered at 40X. Overall, for 4 species (namely Streptococcus agalactiae , Streptococcus mutans , Phocaeicola vultgatus and Deinococcus radiodurans ) we observed that less the 50% of the genomes was covered at 20X. Finally, with PacBio sequencing none of the genome was completely covered at 20X. 3.3 Assembly Evaluation Two different assembly algorithms were used for each sequencing technology: megaHIT and metaSPAdeds for short-reads, metaFlye and metaMDBG for long-reads. The assembly summary statistics are shown in Table 2 . Table 2 Assembly statistics for megaHIT, metaSPAdes, metaFlye and metaMDB : i) Contig: number of produced contigs; ii)Tot Len: assembly total length in bp; iii) % Expected Tot Len: obtained fraction (%) of the sum of lengths for reference mock genomes; iv) Min length: shortest contig length in bp; v) Avg len: average contigs length in bp; vi) Max len: maximum contig length in bp; vii) N50; viii) GC(%). Assembly Contig Tot Len % Expected Tot Len Min len Avg len Max len N50 %GC Illumina megaHIT 1,135 65,990,405 98.43% 773 58,141 955,220 174,256 47.10 metaSPAdes 2,144 66,587,045 99.32% 120 31,057 1,405,032 232,008 47.10 Nanopore metaFlye 113 67,286,575 100.37% 1,002 595,456.40 6,374,455 2,227,176 47.18 metaMDBG 4,838 80,148,321 119.55% 267 16,566.40 4,642,452 41,003 47.19 PacBio metaFlye 358 60,826,078 90.73% 3,125 169,905.20 6,374,538 1,841,921 47.09 metaMDBG 581 65,931,520 98.34% 1,097 113,479.40 6,374,527 2,032,857 47.06 Considering assembly contiguity, Illumina data obtained N50 in the order of hundreds kilobases, regardless the applied assemblers. The widest contigs length distribution was obtained with metaSPAdes, ranging from 120 bp to 1.4 Mbp (Table 2 ). Regarding long-reads, the same assembly algorithms behave differently depending on the analysed data. When using Nanopore data, metaFlye achieved a N50 of 2 Mbp, much longer than 41 kbp for metaMDBG. Moreover, metaMDBG produced the largest number of contigs, with very short ones (267 bp) (Table 2 ). However, for PacBio data the N50 values obtained with metaFlye and metaMDBG were similar, both reaching at least 2 Mb, as well as the number of contigs. In this case, both assemblers were able to produce contigs longer than 1 kbp (Table 2 ). The obtained contigs were evaluated by using MetaQUAST, comparing metagenome assemblies based on alignments to the closest reference genome. The number of contigs covering each genome and the observed coverage are shown in Fig. 3 . Overall, regardless the applied assembly algorithm, all the reference genomes were broadly covered with Illumina data, although the assemblies tended to be more fragmented. The lowest coverage (92.6%) was observed for Porphyromonas gingivalis using megaHIT, which produced 104 contigs. In contrast, Pseudomonas aeruginosa achieved the highest coverage (99.6%) with just 27 contig using metaSPAdes ( Supplementary Table S3 ). Notably, Cutibacterium acnes was assembled at 99.4% coverage and the lowest number of contigs ( 10 ) using both assemblers. For Nanopore data, metaFlye produced the most contiguous assemblies with 4 out 20 genomes assembled at 100% coverage. Specifically, the genome of Escherichia coli was assembled into a single contig with complete coverage. Among the remaining genomes, 7 were assembled with > 99.9% coverage, including 3 single-contig assemblies, and 9 were assembled with > 98.2% coverage, each comprising at least 3 contigs. By contrast, metaMDBG produced the most fragmented assemblies from Nanopore data ( Supplementary Table S3 ). Considering PacBio sequencing data, 4 out 20 genomes were assembled at 100% by metaFlye, including the genome of E. coli with only 1 contig. In this case, only 3 genomes were reconstructed at ≥ 99.9%, meanwhile 10 genomes were assembled at > 90%. Bifidobacterium adolescentis , Bacillus pacificus and Schaalia odontolytica had the less complete assemblies, at 60.1%, 48.9% and 14.1%, respectively. MetaMDBG produced different results with PacBio data. In particular, 6 out 20 genomes were covered at 100%, including Cutibacterium acnes, Escherichia coli, Helicobacter pylori, Pseudomonas aeruginosa and Streptococcus mutans , each assembled in a single contig. About the others, 4 genomes were assembled at > 99.9%, and 8 genomes at > 95.6% with at least 3 contig. Finally, compared to the metaFlye assemblies, the percentage of assembled genome coverage of B. pacificus and S. odontolytica increased to 92.5% and 76.7% , respectively ( Supplementary Table S3 ) . The largest contig obtained for both long-read sequencing approaches, Nanopore and PacBio, corresponded exactly to the Pseudomonas aeruginosa genome (6,374,461 bp). 3.4 MAGs evaluation Contigs binning and refinement was performed by using MetaWRAP relying on MetaBAT2, CONCOCT and MaxBin2 that produces MAGs integrating the bins obtained by each approach. Then the MAGs quality was evaluated by using CheckM and genomes with a completeness ≥ 90% and contamination ≤ 5% were marked as High quality, completeness ≥ 50% and contamination ≤ 10% as medium, otherwise low. The results obtained are summarized in Table 3 . MAGs were taxonomically annotated by using kMetaShot and a phylogenetic three including reference ATCC genomes was built through GTDBtk. Table 3 Summary of the number and quality of obtained MAGs per each assembly approach. High quality Genomes: completeness ≥ 90% and contamination ≤ 5%; Medium quality Genomes: completeness ≥ 50% and contamination ≤ 10%; Low quality Genomes: completeness 10% Assembly Number of MAGs Quality of MAGs Illumina megaHIT 18 16 high; 2 medium metaSPAdes 20 15 high; 5 medium Nanopore metaFlye 18 18 high metaMDBG 24 4 high; 13 medium; 7 Low PacBio metaFlye 19 17 high; 1 medium; 1 Low metaMDBG 20 19 high, 1 medium kMetaShot was able to classify all the obtained MAGs, regardless of their quality, and the obtained classification corresponded to the expected species. Nonetheless, it is worthy to note that Rhodobacter sphaereoides and Propionibacterium acnes were renamed as Cereibacter sphaeroides ( 51 ) and Cutibacterium acnes ( 52 ), respectively. Finally, Bacillus pacificus ATCC 10987 in the NCBI taxonomy is annotated as Bacillus cereus ATCC 10987 , and concordantly labelled by kMetaShot. All the expected genomes were correctly recovered only when assembling Illumina data with metaSPAdes and PacBio data by metaMDBG (Table 3 ). In contrast, when using megaHIT we were unable to retrieve genomes of Streptococcus agalactiae and Staphylococcus epidermidis , both corresponding to genera represented by two species in the mock community (Fig. 4 ) . With Nanopore data processed by metaFlye we obtained 18 high quality MAGs, but the two Staphylococcus spp. were missing. All the expected species were retrieved by using metaMDBG contigs, although according to kMetaShot classification two MAGs were observed for Cutibacterium acnes (high and low), Helicobacter pylori (medium, low), Porphyromonas gengivalis (both low) and Staphylococcus epidermidis (medium, low). This resulted in an overestimation of the number of MAGs, from 20 to 24 (Table 3 , Fig. 4 ). Regarding PacBio data, by using metaFlye the only missing species was Schaalia odontolytica. When assembling PacBio data with metaMDBG, all the expected genomes were retrieved, with 19 MAGs with high quality and 1 MAGs with medium quality, represented by Schaalia odontolytica ( Table 3 , Fig. 4 ). Furthermore, considering both kMetaShot classification and GTDBtk phylogeny are supervised approaches relying on reference genomes collection and taxonomy, an unsupervised clustering and dereplication through ANI (Average Nucleotide Identity) was performed with dRep. It relies on a two steps approach, applying a concise ANI inference trough sketches with MASH to infer primary clusters (ANI ≥ 90%) followed by a secondary clustering through a precise ANI estimation by fastANI (ANI ≥ 95%). Finally, following secondary clustering completion a reference genome per cluster was defined by taking into account genomic features (completeness, contamination, size, and strain heterogeneity), assembly and clustering quality metrics (N50 and centrality). Dereplication results excluding low quality MAGs are shown in Supplementary Table S4 and secondary clustering representative genomes are labelled in Fig. 3 . dRep identified 20 clusters, one per each mock species, and a complete correspondence among clusters and kMetaShot taxonomic classification was observed. Regarding secondary clustering reference genomes and the choice of the reference genome per cluster, 7 out 20 were chosen among ATCC reference ones, while the other 13 were chosen among the MAGs obtained through the assembly of PacBio and Nanopore data (2 Nanopore + metaFlye, 6 PacBio + metaMDBG, and 5 PacBio + metaFlye). The obtained MAGs genome sizes were compared with those of the reference genomes (Fig. 5 ) by using the kMetaShot classification as steering. Regardless sequencing technology and assembly methods, the genome sizes of Rhodobacter sphaeroides and Deinococcus radiodurans were overestimated compared to reference genome. For short reads, Bifidobacterium adolescentis genome size was overestimated with both metaSPAdes and megaHIT, and Bacillus pacificus displayed the same trend when using megaHIT. The assembled genome sizes of Bacillus pacificus , Bifidobacterium adolescentis , Helicobacter pylori , and Porphyromonas gingivalis by using metaFlye exceeded the expected size. By contrast, Acinetobacter baumannii , Phoaecicola vulgatus, Staphylococcus aureus subsp. aureus and Staphylococcus epidermidis MAGs showed a similar trend in PacBio assemblies regardless the applied assembler ( Supplementary Table S5 ). Short-read sequencing allowed complete reconstruction with a 100% match to the reference of only 2 MAGs, whereas long-read sequencing approaches enabled the binning with perfect match of 8 complete MAGs using Nanopore data (metaFlye) and 8 complete MAGs using PacBio data (metaMDBG) ( Supplementary Table S5 ). Finally, considering the mock bacterial species belonging to the same genus, we observed that none of the MAGs obtained by assembling Illumina data using megaHIT were identified as Staphylococcus epidermidis or Staphylococcus aureus . On the contrary, Streptococcus agalactiae and Streptococcus mutans were taxonomically identified, although only the latter had a genome size matching 100% of the reference genome. A different result was obtained when using Illumina data assembled with metaSPAdes. In this case, we identified both Staphylococcus and Streptococcus species, but with a lower match to their respective reference genomes ( S. epidermidis 55.2% and S. aureus 40.4%, S. agalactiae 82.7% and S. mutans 58.7%) ( Supplementary Table S5 ). Regarding long-read sequencing approaches, the results obtained with Nanopore data were like those observed with short reads. Specifically, no MAGs derived from Nanopore data assembled with metaFlye, were identified as S. epidermidis and S. aureus. Conversely, the MAGs classified as S. agalactiae and S. mutans matched the reference genomes at 99.3% and 98.8%, respectively. However, the analysis of Nanopore sequencing data using metaMDBG allowed the reconstruction of MAGs and classification of both Staphylococcus and Streptococcus species despite a lower correspondence to their reference genomes ( Supplementary Table S5 ). In the case of PacBio data assembled with both metaFlye and metaMDBG, Staphylococcus epidermidis and Staphylococcus aureus MAGs were taxonomically classified, although with an overestimation in genome size compared to the reference. Meanwhile, Streptococcus agalactiae MAGs matched the reference genome at 99.7% (metaFlye) and 99.8% (metaMDBG), whereas Streptococcus mutans matched 100% ( Supplementary Table S5 ). Furthermore, we used MASH to measure the distance between the obtained MAGs and their corresponding reference genomes (Supplementary Fig. 2). Regardless of the sequencing technology or assembly approach applied, high quality MAGs showed a distance from reference genomes below 1%. For medium quality MAGs, MASH distances within 2% were observed, with the only exception of Staphylococcus aureus . In this case, the two MAGs assembled from Illumina data (i.e. mash distance of 1,9% with megaHIT and 8.7% with metaSPAdes) were the least similar to the refence genome, even when compared to the low quality MAG assembled from Nanopore reads using metaMDBG (1.4%). Finally, we compared the expected species abundances ( Supplementary Table 1 ) with those inferred from MAGs coverage and relative abundances ( Supplementary Fig. 3 ) across each combination of sequencing technologies and assembly approaches. Regardless of the sequencing strategy and assembly method, substantial discrepancies were observed between observed and expected abundances. 3.5 MAGs Genes Annotation Medium and High-quality MAGs were functionally annotated by using Bakta. To avoid discrepancy due to different annotation pipelines, also ATCC reference genomes were re-annotated by using the same tool. Initially, the number of annotated gene types (i.e. cds, tRNA, rRNA, ncRNA and tmRNA) were compared ( Fig. 6 ) . Overall, a comparable number of cds were obtained in MAGs and ATCC reference genomes, with few differences. An underestimation of cds was observed in 8 out of 20 species (namely H. pylori , L. gasseri , N. meningitidis , P. gengivalis , S. aureus , S. epidermidis , S agalactiae , and S. mutans ) when MAGs obtained from short reads are considered, regardless the assembly methods ( Fig. 6 ). Regarding long reads, the number of predicted CDS was not influenced by assembly methods in PacBio data, while for Nanopore data metaMDBG tended to produce less accurate annotation compared to metaFlye ( Fig. 6 ). Using MAGs retrieved from long reads, both Nanopore and PacBio, we observed an overall tendency to predict a number of ncRNA genes like those observed in reference genomes. Moreover, the number of predicted ncRNA genes in long-reads derived MAGs resulted influenced by the assembly quality and reference genome coverage ( Fig. 6 ) . For instance, for MAGs obtained by assembling Nanopore data through metaMDBG and assigned to C. beijerinckii and N. meningitidis (both medium quality and with a reference genome coverage ≤ 90%) an underestimation of both rRNA and tRNA was observed. Similarly, S. odontolytica MAGs obtained with PacBio data were the least accurate in terms of ncRNA genes annotation because of lower MAGs completeness (73.12%, Supplementary Table S5 ). An underestimation of annotated ncRNA genes was observed in MAGs obtained from short reads, regardless of the assembly method. The impact of the underestimation was associated to both MAGs completeness and genes redundancy ( Supplementary Table S5 ). Indeed, for both Illumina inferred MAGs classified as L. gasseri (both medium quality) the number of tRNA genes was underestimated compared to the reference genome and not rRNA genes were identified at all ( Supplementary Table S6 ). Regarding genes redundancy, the number of genes for Alanine and Isoleucine tRNAs were underestimated in 14 out 20 species ( Supplementary Table S6 ). Despite both A. baumani MAGs retrieved from short reads were classified as high quality, both were able to retrieve just 1 out of 7 expected Alanine tRNA genes ( Supplementary Table S6 ) . A comparison of these seven genes demonstrated 6 were identical while one was unique, sharing a 71% of similarity with the others. The Alanine tRNA gene retrieved for both MAGs corresponded to the unique one in reference genomes ( Supplementary Table S6 ). Furthermore, we also evaluated the quality of protein coding genes annotations by comparing the protein length profiles between MAGs and ATCC reference genomes (Fig. 7 ). Overall, no relevant differences were observed in the length profiles of MAGs retrieved from short reads compared to those of reference genomes, with the only exception of L. gasseri (medium quality MAGs). Considering MAGs obtained by binning metaMDBG and metaFlye contigs from Nanopore reads, we observed statistically relevant differences for 14 and 10 out of 20 species, respectively, 8 in common (namely C. sphaeroides , D. radiodurans , H. pylori , N. meningitidis , P. vulgatus, S. odontolytica , S. agalactiae , and S.mutans ). Finally, regarding MAGs retrieved from PacBio we observed statically significant differences in protein length profiles in 4 out of 20 species ( B. cereus group , B. adolescentis , C. beijerinckii , and S. odontolytica ). Specifically, for all these 4 species metaMDBG MAGs were evaluated as medium quality, while the only MAGs obtained with metaFlye ( C. beijerinckii ) reached a high-quality classification. Finally, we evaluated the annotated protein genes by comparing the predicted functions in ATCC reference genomes to the obtained MAGs. A qualitative representation of the obtained results is available in Fig. 8 . PacBio retrieved MAGs obtained the closest results compared to reference genomes (metaFlye Jaccard: mean 4.02%, median 0.65%, metaMDBG Jaccard: mean 4.39%, median 1.10%) with only three species showing relevant differences: S. odontolytica (metaMDBG, medium quality genome), B. pacificus (B. cereus group, metaMDBG) and B. adolescentis (metaFlye). These data supported the observed reference genome coverage and assembly quality (Supplementary Table 4). Furthermore, three metaFlye assembled MAGs, classified as S. mutans, P. aeruginosa and E. coli , were the only to achieve a Jaccard distances from reference equal to 0. Following, acceptable results were also obtained in Illumina retrieved MAGs (metaSPAdes Jaccard mean: 12.03% median: 4.22%; megaHIT Jaccard mean: 6.42% median: 3.70%). The species achieving the largest distance from reference genome were Staphylococcus spp. , Streptococcus spp. and L. gasserii . Nanopore data assembled through metaFlye obtained comparable results to those obtained with short-reads (Jaccard mean: 8.73% median: 6.75%). H. pylori MAGs obtained the largest dissimilarity from reference genome (41.2%). Finally, Nanopore MAGs obtained through metaMDBG assembly were the furthest from the reference genomes (Jaccard mean: 24.00% median: 26.05%). 4. DISCUSSION Our understanding of the complex network between microorganisms and surrounding environment, including diversity, structure and dynamics of microbial communities, is still incomplete, due to the challenges that metagenomic studies face during library preparation, sequencing and analysis steps ( 9 , 53 ). Shotgun metagenomics is an informative approach to rapidly obtain a compositional profiling of the investigated microbial community (i.e. reference based) or to retrieve nearly complete microorganism genomes (i.e. assembly based) or MAGs. The latter is gaining an ever-growing interest in the research community also due to the decreasing sequencing costs ( 54 ). In this benchmark study the impact of different sequencing technologies and metagenome assembler tools have been thoroughly evaluated to measure their impact on high-quality MAGs retrieval. Cutting-edge sequencing platforms, NovaSeq 6000 (Illumina), GridION (Nanopore) and Sequel IIe (PacBio) have been employed in this work. Moreover, functional annotation and concordance evaluation among reference and retrieved genomes has been investigated to find the best performing sequencer-assembler combination. All the aspects here discussed, introduced an updated point of view with respect to previous works ( 21 , 29 ). A commercial mock composed of 20 bacterial species belonging to 18 genera was used, with a total DNA yield of 200 ng. Limited total DNA yield and presence of phylogenetically related species (i.e. co-generic) reflect possible conditions that may occur naturally in biological samples. Moreover, this mock community is represented by a limited number of microorganisms allowing, by contrast, a more thoroughly investigation of technical aspects. For short-read sequencing (Illumina) a method involving chemical DNA fragmentation and PCR-based amplification step during library preparation has been used. As a result, Illumina generated the highest data output, reaching hundreds of millions of reads, a coverage of 100X and the highest sequencing depth for all mock genomes. At the same time, we selected the available amplification-free protocols for long-read sequencing to align with the goal of single-molecule sequencing, producing reads longer than 10 kb, removing amplification bias while preserving base modification ( 55 ). The ligation-sequencing protocol without fragmentation-step was chosen for Nanopore application, meanwhile we sequenced medium-sized fragments obtained from mechanical fragmentation (about 12kbp peak size) for the PacBio application. In both cases, the used input yield did not completely meet protocol requirements. Nonetheless, we obtained an adequate number of reads to perform bioinformatic analysis, with Nanopore producing a higher yield and longer reads compared to PacBio. PacBio HiFi sequencing, however, generated fewer reads overall but the trimming procedures did not affect their number, reflecting the inherently high base accuracy of HiFi data. This observation is consistent with recent literature showing that, although Nanopore can produce longer and more abundant reads, PacBio HiFi delivers highly accurate long reads that ultimately support superior genome reconstruction and higher-quality metagenome assemblies ( 37 , 56 ). The reduced number of reads obtained with PacBio sequencing probably affected the coverage and sequencing depth exclusively for the species S. odontolytica , compared to the other technologies. The strong impact of sequencing depth and uneven coverage on MAG recovery that observed for this low-abundance taxon is consistent with previous studies highlighting coverage as a major limiting factor for metagenome assembly and genome-resolved metagenomics ( 55 , 59 ). The effort in library preparation and in pushing on the sequencing yield was crucial to avoid the lack of sufficient sequencing depth and coverage which represents a critical aspect in metagenome reconstruction ( 37 ) and can influence the performance of meta-assembly tools ( 36 ). In fact, metagenome assembly is the crucial step to retrieve near complete and high-quality MAGs. Two main algorithms have been used in this benchmark: de-Bruijn ( 57 ) and repeat graphs ( 58 ). The first is principally used in the case of short reads assembly but application on third generation sequencing technologies are also available ( 37 ). The second is particularly suitable for long reads assembly. Here, two tools for each sequencing technology have been employed to consider the impact of the assembly step. For Illumina short-reads, the de-Bruijn graph-based tools megaHIT and metaSpades have been used, that differ in terms of algorithm heuristics and optimizations. Overall, metaSpades produced assembly with both a larger N50 and contig length than megaHIT. MetaSpades assembly allowed to retrieve all the expected species with a higher quality than megaHIT. In general, metaSpades performed slightly better and more precisely than megaHIT that also missed MAGs for two expected species (i.e. S. agalactiae , S. epidermidis ). For long reads data the repeat graph-based tool metaFlye and de-Bruijn graph-based tool metaMBDG were employed. We chose two alternative approaches because it is well known that string-graph based approaches are poorly able to catch low-abundance microorganisms and strain-heterogeneity, which negatively impacts MAGs quality ( 59 ). On the other hand, de Bruijn graph relies on exact k-mers matching which is affected by the long-reads lower accuracy compared to short-one. MetaFlye on Nanopore and metaMDBG on PacBio assemblies reached the highest N50 value (~ 2Mbp) and coherently obtained the highest number of high-quality MAGs. This higher yield of high-quality MAGs from long-read data is consistent with previous HiFi- and Nanopore-based genome-resolved metagenomic studies, which showed that long and accurate reads substantially increase the number and quality of recovered MAGs compared to short-read assemblies ( 37 ). Furthermore, if Nanopore with both assemblers and PacBio with metaFlye missed MAGs for some expected species, metaMDBG on PacBio retrieved MAGs for all the mock community species. Thus, Illumina-metaSpades and PacBio-metaMDBG combinations allow to appreciate more biodiversity than others distinguishing co-generic species. On the contrary, both Illumina-megaHit and Nanopore-metaMDBG fail in co-generic genomes discrimination. About assembly contiguity the combination of PacBio and metaMDBG obtain the higher number of reference genomes (i.e. 12) covered by the lowest number of contigs, followed by Nanopore and metaFlye (i.e. 8) (Fig. 2 ). This demonstrates that repeat graph algorithm metaFlye allows to overcome Nanopore low quality reads and coverage, while metaMDBG sparse de Brujin graph performs better with long high-quality reads from PacBio. In line with earlier platform comparisons, third-generation sequencing generally improves assembly contiguity compared to Illumina short reads ( 55 ). Notably, when comparing the obtained MAGs with their reference genomes using sketch-based distances, higher MAG qualities corresponds to lower measured distances. Dereplication of obtained MAGs and reference ATCC genomes through ANI estimation produced intriguing results with 13 out of 20 reference genomes chosen among long reads generated MAG (Fig. 3 , Supplementary Table S4 ). This is a further demonstration that retrieved MAGs particularly from PacBio-metaMDBG combination are more accurate than reference genomes counterpart. Nonetheless, MAGs' sizes were compared with those of the reference genomes (Fig. 4 ). Overall, Illumina data tended to not correctly estimated genome sizes, while Nanopore assembly were influenced by the applied assembler. PacBio data were more consistent with the reference genomes and less affected by the assembly approach. Regardless of sequencing technology and assembly approach, D. radiodurans and R. sphaeroides genome sizes were overestimated. This finding could be due to the overall quality of the reference genomes, that are both fragmented. In fact, in Fig. 3 for these two species two PacBio-metaFlye MAGs were chosen as centroid by dRep dereplication. The key novelty of this work lies in the comprehensive evaluation of how different sequencing technologies and combined assembly strategies influence genome annotation, specifically in terms of number of annotated genes, the length of predicted proteins, and number and types of inferred functions. First, about genes prediction short reads and Nanopore coupled with metaMDBG MAGs tend to underestimate cds and non-coding genes (Fig. 5 ), principally due to MAGs incompleteness and genes redundancy. Nonetheless, from PacBio MAGs all copies of redundant genes such as Alanine tRNA genes can be annotated, as well as in Nanopore-metaFlye MAGs. Second, about protein length distribution, intriguingly Fig. 4 shows a statistically significant underestimation of protein length for MAGs from long-reads. In particular, 14, 10, 4, 1 and 1 were the MAGs with a protein length underestimation for Nanopore-metaMDBG, Nanopore-metaFlye, PacBio-metaMDBG, PacBio-metaFlye and Illumina-metaSpades combinations, respectively, highlighting the impact of MAGs completeness in functional annotation (Fig. 5 ). It is worthy to note how Liu et al., have applied five rounds of genome polishing to retrieve highly accurate MAGs, when by assembling ONT reads with metaFlye and binning with metaWRAP ( 60 ). This is a bioinformatic procedure aimed to reduce missassemblies by comparing contigs and raw reads. Indeed, as we demonstrated in our results as already reported in literature that ONT assemblies are less accurate compared to PacBio one ( 58 ). To reduce the impact of missassemblies in genomic projects, several rounds of polishing are suggested, by also employing short-reads ( 61 ). As demonstrated by Liu et al., applying polishing on metagenomic data becomes computationally expensive and for this reason we decided to do not include also this step in our analysis, but it can explain the lower accuracy in genes annotation in ONT data compared to other technologies. Finally, about comparison of inferred functions from reference genomes vs MAGs, Nanopore-metaMDBG combination was the worst, followed by Illumina, Nanopore-metaFlye and PacBio- metaMDBG with similar performances. Instead, PacBio-metaFlye revealed the best performances and perfect concordance for 3 MAGs. 5. CONCLUSIONS High-throughput sequencing techniques have revolutionized the field of microbial ecology. Constant improvements in shotgun metagenomics protocols and assemblers hold great potential for advancing our understanding, thanks to the rise of long-read sequencing approaches ( 3 , 4 , 14 , 53 ). However, assembling all genomes of all bacteria within a single microbial niche remains challenging ( 14 ). Incomplete reference genomes with gaps, horizontal gene transfer between species, and the obstacles posed by low-biomass samples, as tissue microbiomes, are all challenges that still need to be overcome to achieve a genome-resolved metagenomics ( 9 , 62 ). The goal of the scientific community is to obtain comprehensive insights into microbial diversity, genomic function and interactions among microorganisms. From this perspective, this benchmark study offers a comparative framework of currently available sequencing strategies in metagenomic field and their limitations. Our findings may contribute to future methodological advancements aimed at improving strain-resolved metagenome assemblies for microbial characterization, even in highly complex environments. Abbreviations HTS high throughput sequencing technologies MAGs Metagenome-assembled Genomes ANI Average Nucleotide Identity CDS protein coding DNA sequencing tRNA transfer RNA rRNA ribosomal RNA tmRNA transfer-messenger RNA ncRNA non-coding RNA Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials The datasets generated during the current study are available in the ENA repository, reference number BioProject PRJEB89875. Competing interests The authors declare that they have no competing interests. Funding This work was supported by projects: Life Science Hub Regione Puglia (LSH-Puglia, T4-AN-01 H93C22000560003), INNOVA - Italian network of excellence for advanced diagnosis (PNC-EJ-2022-23683266 PNC-HLS-DA), DARE - DigitAl lifelong pRevEntion initiative (PNC-I.1 "Research initiatives for innovative technologies and pathways in the health and welfare sector” D.D. 931 of 06/06/2022, code PNC0000002, CUP: B53C22006420001), and by ELIXIR-IT through the PNRR Project ELIXIRxNextGenIT - ELIXIR x NextGenerationIT: consolidation of the Italian Infrastructure for Omics Data and Bioinformatics (Grant Code IR0000010, CUP:B53C22000690005). Authors' contributions Conceptualization, G.V., E.N., B.F., M.M. and G.P.; methodology, G.V., E.N., G. D, M.F.C., B.F. and M.M.; validation, G.V., E.N., G. D, B.F. and M.M.; formal analysis, G.V., E.N., G. D, B.F. and M.M.; investigation, G.V., E.N., G. D, B.F. and M.M.; resources, G.P.; data curation, G.V., E.N., G. D, B.F. and M.M.; writing—original draft preparation, G.V., E.N., G. D, B.F. and M.M. writing—review and editing, G.V., E.N., G. D, M.F.C., B.F., M.M. and G.P.; visualization, E.N., G.V. and G.D.; supervision, G.P.; project administration, B.F. and M.M.; funding acquisition, G.P. All authors have read and agreed to the published version of the manuscript. Acknowledgements Not applicable References Staley, J. T. & Konopka, A. measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu. Rev. Microbiol. 39 (1), 321–346 (1985). Pérez-Cobas, A. E., Gomez-Valero, L. & Buchrieser, C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. Microb. Genomics ; 6 (8). (2020). Bharti, R. & Grimm, D. G. Current challenges and best-practice protocols for microbiome analysis. Brief. Bioinform. 22 (1), 178–193 (2021). Purushothaman, S., Meola, M. & Egli, A. Combination of Whole Genome Sequencing and Metagenomics for Microbiological Diagnostics. IJMS 23 (17), 9834 (2022). Notario, E. et al. Amplicon-Based Microbiome Profiling: From Second- to Third-Generation Sequencing for Higher Taxonomic Resolution. Genes 14 (8), 1567 (2023). Smillie, C. S. et al. Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480 (7376), 241–244 (2011). Li, C., Chen, J. & Li, S. C. Understanding Horizontal Gene Transfer network in human gut microbiota. Gut Pathog . 12 (1), 33 (2020). Jiang, Y. et al. GutMetaNet: an integrated database for exploring horizontal gene transfer and functional redundancy in the human gut microbiome. Nucleic Acids Res. 53 (D1), D772–D782 (2025). Carabeo-Pérez, A., Guerra-Rivera, G., Ramos-Leal, M. & Jiménez-Hernández, J. Metagenomic approaches: effective tools for monitoring the structure and functionality of microbiomes in anaerobic digestion systems. Appl. Microbiol. Biotechnol. 103 (23–24), 9379–9390 (2019). Silverstein, M. R., Segrè, D. & Bhatnagar, J. M. Environmental microbiome engineering for the mitigation of climate change. Glob. Change Biol. 29 (8), 2050–2066 (2023). Anyansi, C., Straub, T. J., Manson, A. L., Earl, A. M. & Abeel, T. Computational Methods for Strain-Level Microbial Detection in Colony and Metagenome Sequencing Data. Front. Microbiol. 11 , 1925 (2020). Lapidus, A. L. & Korobeynikov, A. I. Metagenomic Data Assembly – The Way of Decoding Unknown Microorganisms. Front. Microbiol. 12 , 613791 (2021). Pinto, Y. & Bhatt, A. S. Sequencing-based analysis of microbiomes. Nat. Rev. Genet. 25 (12), 829–845 (2024). Kim, N. et al. Genome-resolved metagenomics: a game changer for microbiome medicine. Exp. Mol. Med. 56 (7), 1501–1512 (2024). The Genome Standards Consortium et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35 (8), 725–731 (2017). Defazio, G., Tangaro, M. A., Pesole, G. & Fosso, B. kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes. Brief. Bioinform. 26 (1), bbae680 (2024). Sabina, J. & Leamon, J. H. Bias in Whole Genome Amplification: Causes and Considerations. In: (ed Kroneis, T.) Whole Genome Amplification [. New York, NY: Springer New York; 15–41. (Methods in Molecular Biology; vol. 1347). (2015). Nelson, M. T. et al. Human and Extracellular DNA Depletion for Metagenomic Analysis of Complex Clinical Infection Samples Yields Optimized Viable Microbiome Profiles. Cell. Rep. 26 (8), 2227–2240e5 (2019). Pereira-Marques, J. et al. Impact of Host DNA and Sequencing Depth on the Taxonomic Resolution of Whole Metagenome Sequencing for Microbiome Analysis. Front. Microbiol. 10 , 1277 (2019). McArdle, A. J. & Kaforou, M. Sensitivity of shotgun metagenomics to host DNA: abundance estimates depend on bioinformatic tools and contamination is the main issue. Access. Microbiol. ; 2 (4). (2020). Latorre-Pérez, A., Pascual, J., Porcar, M. & Vilanova, C. A lab in the field: applications of real-time, in situ metagenomic sequencing. Biology Methods Protocols . 5 (1), bpaa016 (2020). Kim, H. J. et al. Microbial profiling of peri-implantitis compared to the periodontal microbiota in health and disease using 16S rRNA sequencing. J. Periodontal Implant Sci. 53 (1), 69 (2023). Arredondo, A. et al. Comparative 16S rRNA gene sequencing study of subgingival microbiota of healthy subjects and patients with periodontitis from four different countries. J. Clin. Periodontology . 50 (9), 1176–1187 (2023). Marzano, M. et al. Farnesoid X receptor activation by the novel agonist TC-100 (3α, 7α, 11β-Trihydroxy-6α-ethyl-5β-cholan-24-oic Acid) preserves the intestinal barrier integrity and promotes intestinal microbial reshaping in a mouse model of obstructed bile acid flow. Biomed. Pharmacother. 153 , 113380 (2022). Tumolo, M. et al. Linking feed, biodiversity, and filtration performance in a Self-Forming Dynamic Membrane BioReactor (SFD MBR) treating canning wastewater. J. Water Process. Eng. 66 , 106031 (2024). Roy, G., Prifti, E., Belda, E. & Zucker, J. D. Deep learning methods in metagenomics: a review. Microb. Genomics ; 10 (4). (2024). Ben Khedher, M., Ghedira, K., Rolain, J. M., Ruimy, R. & Croce, O. Application and Challenge of 3rd Generation Sequencing for Clinical Bacterial Studies. IJMS 23 (3), 1395 (2022). Govender, K. N. & Eyre, D. W. Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications. Microb. Genomics ; 8 (10). (2022). Meslier, V. et al. Benchmarking second and third-generation sequencing platforms for microbial metagenomics. Sci. Data . 9 (1), 694 (2022). Bokulich, N. A., Ziemski, M., Robeson, M. S. & Kaehler, B. D. Measuring the microbiome: Best practices for developing and benchmarking microbiomics methods. Comput. Struct. Biotechnol. J. 18 , 4048–4062 (2020). Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30 (15), 2114–2120 (2014). Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31 (10), 1674–1676 (2015). Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27 (5), 824–834 (2017). Leger, A. & Leonardi, T. pycoQC, interactive quality control for Oxford Nanopore Sequencing. JOSS 4 (34), 1236 (2019). Bonenfant, Q., Noé, L. & Touzet, H. Porechop_ABI: discovering unknown adapters in Oxford Nanopore Technology sequencing reads for downstream trimming. Zhang Z, editor. Bioinformatics Advances. ;3(1):vbac085. (2023). Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods . 17 (11), 1103–1110 (2020). Benoit, G. et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nat. Biotechnol. 42 (9), 1378–1383 (2024). Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet j. 17 (1), 10 (2011). Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29 (8), 1072–1075 (2013). Shen, W., Le, S., Li, Y., Hu, F. & SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. Zou Q, editor. PLoS ONE. ;11(10):e0163962. (2016). Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6 (1), 158 (2018). Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7 , e7359 (2019). Wu, Y. W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32 (4), 605–607 (2016). Alneberg, J. et al. CONCOCT: Clustering cONtigs on COverage and ComposiTion (arXiv, 2013). Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25 (7), 1043–1055 (2015). Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17 (1), 132 (2016). Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Borgwardt K, editor. Bioinformatics. ;38(23):5315–6. (2022). Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11 (12), 2864–2868 (2017). Ward, J. H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 58 (301), 236–244 (1963). Schwengers, O. et al. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification: Find out more about Bakta, the motivation, challenges and applications, here. Microb. Genomics ; 7 (11). (2021). Hördt, A. et al. Analysis of 1,000 + Type-Strain Genomes Substantially Improves Taxonomic Classification of Alphaproteobacteria. Front. Microbiol. 11 , 468 (2020). Nouioui, I. et al. Genome-Based Taxonomic Classification of the Phylum Actinobacteria. Front. Microbiol. 9 , 2007 (2018). Filardo, S., Di Pietro, M. & Sessa, R. Current progresses and challenges for microbiome research in human health: a perspective. Front. Cell. Infect. Microbiol. 14 , 1377012 (2024). Van Rossum, T., Ferretti, P., Maistrenko, O. M. & Bork, P. Diversity within species: interpreting strains in microbiomes. Nat. Rev. Microbiol. 18 (9), 491–506 (2020). Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21 (1), 30 (2020). Bein, B. et al. Long-read sequencing and genome assembly of natural history collection samples and challenging specimens. Genome Biol. 26 (1), 25 (2025). Compeau, P. E. C., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29 (11), 987–991 (2011). Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37 (5), 540–546 (2019). Bickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40 (5), 711–719 (2022). Liu, L., Yang, Y., Deng, Y. & Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. Microbiome 10 (1), 209 (2022). Luan, T. et al. Benchmarking short and long read polishing tools for nanopore assemblies: achieving near-perfect genomes for outbreak isolates. BMC Genom. 25 (1), 679 (2024). Hoang, M. T. V., Irinyi, L., Hu, Y., Schwessinger, B. & Meyer, W. Long-Reads-Based Metagenomics in Clinical Diagnosis With a Special Focus on Fungal Infections. Front. Microbiol. 12 , 708550 (2022). Additional Declarations No competing interests reported. Supplementary Files Supplementaryfilescaption.docx SupplementaryTableS4.xlsx SupplementaryTableS3.xlsx SupplementaryTableS5.xlsx SupplementaryTableS2.xlsx SupplementaryTableS6.xlsx SupplementaryTableS1.xlsx SupplementaryFigureS1.jpg SupplementaryFigureS2.png SupplementaryFigureS3.png Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 16 Jan, 2026 Reviews received at journal 07 Jan, 2026 Reviews received at journal 30 Dec, 2025 Reviewers agreed at journal 10 Dec, 2025 Reviewers agreed at journal 10 Dec, 2025 Reviewers invited by journal 08 Dec, 2025 Submission checks completed at journal 05 Dec, 2025 First submitted to journal 04 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7581938","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":581652977,"identity":"2b6b59f0-958c-4f88-b542-db04ba7dbd73","order_by":0,"name":"Grazia Visci","email":"","orcid":"","institution":"University of Bari Aldo Moro","correspondingAuthor":false,"prefix":"","firstName":"Grazia","middleName":"","lastName":"Visci","suffix":""},{"id":581652978,"identity":"87dbabb0-d862-4936-8190-952b02ea5d4a","order_by":1,"name":"Elisabetta Notario","email":"","orcid":"","institution":"National Research Council","correspondingAuthor":false,"prefix":"","firstName":"Elisabetta","middleName":"","lastName":"Notario","suffix":""},{"id":581652979,"identity":"d3a6432a-1f14-408a-829f-5243034f9009","order_by":2,"name":"Giuseppe Defazio","email":"","orcid":"","institution":"University of Bari Aldo Moro","correspondingAuthor":false,"prefix":"","firstName":"Giuseppe","middleName":"","lastName":"Defazio","suffix":""},{"id":581652980,"identity":"b3fd0f4c-ab9a-4193-8967-9d09f42231f3","order_by":3,"name":"Mariano Francesco Caratozzolo","email":"","orcid":"","institution":"National Research Council","correspondingAuthor":false,"prefix":"","firstName":"Mariano","middleName":"Francesco","lastName":"Caratozzolo","suffix":""},{"id":581652981,"identity":"bf5dc8b7-d655-4c66-aba0-821dbf78367f","order_by":4,"name":"Bruno Fosso","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAz0lEQVRIiWNgGAWjYBACAwglByIYHzAwSCQA6QZitBiDCGYDqJZGvHqQtbBJAIkEBkLWmEufMXvwgcFATr7/8LFq3h0WefwNzO0P8Gmx7MsxN5zBYGBscCMt7TbvGYliiQOEHHaGx0yah+FP4gYJHrPbvG0SiQ1EajFInN9/xqwYpGU+0VoaDuSYMYO0bCCkxbKHrUxyhgHYL8mSc9skig0PMzbOwKfFnId5m8SHCnCIHfzwtq0uT+54OzAICQIDZA4zYfWjYBSMglEwCggAAP5TQt8HcpbsAAAAAElFTkSuQmCC","orcid":"","institution":"University of Bari Aldo Moro","correspondingAuthor":true,"prefix":"","firstName":"Bruno","middleName":"","lastName":"Fosso","suffix":""},{"id":581652982,"identity":"bcb515be-cb56-4c27-8d1f-4f678acc717a","order_by":5,"name":"Marinella Marzano","email":"","orcid":"","institution":"National Research Council","correspondingAuthor":false,"prefix":"","firstName":"Marinella","middleName":"","lastName":"Marzano","suffix":""},{"id":581652983,"identity":"d6af9c14-8eb0-47db-bca2-31bd95e2118f","order_by":6,"name":"Graziano Pesole","email":"","orcid":"","institution":"University of Bari Aldo Moro","correspondingAuthor":false,"prefix":"","firstName":"Graziano","middleName":"","lastName":"Pesole","suffix":""}],"badges":[],"createdAt":"2025-09-10 10:38:26","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7581938/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7581938/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101458745,"identity":"a393b731-d53f-467a-a44d-19c6fe032863","added_by":"auto","created_at":"2026-01-30 01:01:26","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":254193,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eDistributions of sequencing depths for contigs with length of at least 100 bp obtained with Illumina, Nanopore and PacBio sequencing platforms.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cem\u003e\u003cstrong\u003eEach dot represents a contig and is colored according to its coverage.\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/bcac60479335fb79c1b0c203.png"},{"id":101458730,"identity":"2ceb3c7b-9e79-4d37-9c3f-6bc72e8faa0a","added_by":"auto","created_at":"2026-01-30 01:01:17","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":479494,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eFor each reference genome, coverage breadth was calculated as the proportion of the genome covered at a minimum depth of 20x, 30x, 40x, and 50x.\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/1fd83e482aeafbcccaace044.png"},{"id":101458727,"identity":"3e976ae8-4ac2-4211-874c-951e6538a9cd","added_by":"auto","created_at":"2026-01-30 01:01:17","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":414304,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eObserved coverage (color) and number of fragments (x axis) covering the reference genomes (y axis) for combinations of sequencing platforms and assemblers (panels).\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/6137daf41dd9bd9f2e3e753f.png"},{"id":101458731,"identity":"e4ed0f8a-66a0-4748-8bad-ec8f894d7b09","added_by":"auto","created_at":"2026-01-30 01:01:18","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":554495,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eTree representing the phylogenetic relationship among the observed Hybrid MAGs and ATCC reference genomes.\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e Circular heatmaps correspond to MAGs quality (Quality), the combination of sequencing approach and assembly method (Technology) and species scientific name (Species). dRep representative genomes are marked on tree leaves with *.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/927f527d0e4e9e0c040213fa.png"},{"id":101751872,"identity":"94374b64-d2d4-40de-8107-cf0a1418dd4c","added_by":"auto","created_at":"2026-02-03 10:24:05","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":507879,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eComparison of the MAGs size to the expected genomes length for each combination of sequencing technology and assembly approach\u003c/strong\u003e\u003c/em\u003e\u003cem\u003e. Each dot represents the longest contig and is colored according to coverage value.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/e65ac3c5391eda2d4a715d3d.png"},{"id":101458733,"identity":"b835a44b-54eb-4155-be55-a443a71d89cc","added_by":"auto","created_at":"2026-01-30 01:01:18","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":497577,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eNumber of annotated genes for each type (cds, tRNA, rRNA, ncRNA and tmRNA) per each genome.\u003c/strong\u003e\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/0311d68a92a029f221d7d140.png"},{"id":101458738,"identity":"146f0995-5e6f-46e1-940d-49727c9aaf0b","added_by":"auto","created_at":"2026-01-30 01:01:19","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":550270,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eBoxplot showing the proteins length distribution profile. \u003c/strong\u003e\u003c/em\u003e\u003cem\u003eA panel per Mock species is drawn. Pairwise comparisons were performed by using the Wilcoxon test and protein length profile from the corresponding ATCC genome as reference. Relevant results are shown by using stars (* p-value ≤ 0.05, ** p- value ≤ 0.01, *** p- value ≤ 0.001, **** p- value ≤ 0.0001).\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/048fcfb959ec250a8618c02f.png"},{"id":101751678,"identity":"ddfd43e7-aa82-41c3-bef7-1bac73298c72","added_by":"auto","created_at":"2026-02-03 10:22:17","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":710927,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cem\u003e\u003cstrong\u003eStacked bar-plot showing the number of protein functions annotated in ATCC reference genome and the obtained MAGs. \u003c/strong\u003e\u003c/em\u003e\u003cem\u003eEach panel corresponds to a combination of sequencing approach and assembly method. Per each species the number of shared functions among the reference genome and the corresponding MAGs are shown in khaki, those exclusively found in reference genome in dark green and those peculiar to the exploited methodology in the color listed in the legend. In each bar is also shown the measured Jaccard distance among the reference genome and MAGs annotation (in percentage). The largest is the value the larger are the discrepancy in protein annotations.\u003c/em\u003e\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/5344348c68034c1fb53499c6.png"},{"id":101756212,"identity":"c854d8ee-d51c-44b9-b295-e2a5a8eeea70","added_by":"auto","created_at":"2026-02-03 10:57:01","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":5992461,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/8cf1aa64-63fe-40d8-bf4b-3d915c2c18c2.pdf"},{"id":101458741,"identity":"89d68f16-9d71-48b9-89f2-2a7f9dac14b3","added_by":"auto","created_at":"2026-01-30 01:01:19","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":19102,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementaryfilescaption.docx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/7d529945e7c428de56786804.docx"},{"id":101751896,"identity":"f6580b5a-f512-47c3-94f6-af1988325e1c","added_by":"auto","created_at":"2026-02-03 10:24:19","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":15295,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS4.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/5d496ea91b1c20c5f8783441.xlsx"},{"id":101458728,"identity":"212196bd-e7f9-46e5-8db3-ff63423d7b4a","added_by":"auto","created_at":"2026-01-30 01:01:17","extension":"xlsx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":14359,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS3.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/c8df0f80005da3d7d641a06a.xlsx"},{"id":101458732,"identity":"3fd1fe66-2a82-45c4-9bbe-193628ddfc78","added_by":"auto","created_at":"2026-01-30 01:01:18","extension":"xlsx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":23408,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS5.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/67d7d176695c229e8cf78b72.xlsx"},{"id":101458742,"identity":"3aaac3e1-ea0f-4f39-8efe-4231e2313e2c","added_by":"auto","created_at":"2026-01-30 01:01:19","extension":"xlsx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":26273,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS2.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/f8befb306fb9b91ddc9a3950.xlsx"},{"id":101751895,"identity":"34969f13-bd20-4355-81b9-c5e1462d56ed","added_by":"auto","created_at":"2026-02-03 10:24:19","extension":"xlsx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":26294,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS6.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/76a7e5fdf98302549c1fff77.xlsx"},{"id":101458735,"identity":"9d3995a0-09d9-4008-8ff6-96fa0e00611e","added_by":"auto","created_at":"2026-01-30 01:01:18","extension":"xlsx","order_by":7,"title":"","display":"","copyAsset":false,"role":"supplement","size":10760,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/b65e990b82686b9fdd940a89.xlsx"},{"id":101458736,"identity":"8d1548d7-4b78-489a-8158-b3884af0d4f7","added_by":"auto","created_at":"2026-01-30 01:01:18","extension":"jpg","order_by":8,"title":"","display":"","copyAsset":false,"role":"supplement","size":499873,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/9f3b6bceba3404877f570f75.jpg"},{"id":101751980,"identity":"709c8d3a-0a64-4e74-b42e-de07fbbf76c7","added_by":"auto","created_at":"2026-02-03 10:24:38","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"supplement","size":565598,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS2.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/6d0e8e9917e580ed95e2b3cd.png"},{"id":101458740,"identity":"73bb8d6b-67f0-48e2-bd67-64c242f67aee","added_by":"auto","created_at":"2026-01-30 01:01:19","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"supplement","size":386584,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryFigureS3.png","url":"https://assets-eu.researchsquare.com/files/rs-7581938/v1/dc0f6b43b57f0df880f11698.png"}],"financialInterests":"No competing interests reported.","formattedTitle":"Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eExploring the taxonomic and functional biodiversity of microbial communities is essential for understanding ecosystem complexity, considering both the organisms and their roles. Microbial communities largely populate environmental or host-related niches and include bacteria, archaea, fungi, protists and viruses. As traditional approaches, relying on isolation in culture of microorganisms, principally prokaryotes, may uncover only about 1% of microbial biodiversity (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e), DNA-sequencing based technologies have represented a revolutionary breakthrough. In the last two decades, high throughput sequencing technologies (HTS) have significantly enhanced our understanding of microbial communities and their essential roles in ecosystems as well as in human, animal, and plant health (\u003cspan additionalcitationids=\"CR3\" citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e), paving the way to the so-called metagenomics approaches, such as amplicon-based (or DNA-metabarcoding) and shotgun metagenomics. Amplicon-based metagenomics relies on the selective amplification and sequencing of specific target genes (i.e. 16S or 18S rRNA genes, ITS) to obtain the taxonomic profile of microbial communities, while shotgun metagenomics allows for the random sequencing of the entire genetic content of these communities providing not only taxonomic but also functional information (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Both methods are valuable for studying and characterizing microbiomes, each offering distinct advantages and being chosen based on the specific research question as well as cost considerations (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). However, while DNA barcodes can range from 100 to 1,600 bp in length, a prokaryotic genome, for example, has an average size around 5 Mbp, making intuitively shotgun metagenomics the most informative approach. Moreover, findings of shotgun metagenomics studies suggest that various microbiome interactions, such as horizontal gene transfer, genetic content networks or microbiota-dependent metabolites, can have significant implications for the host-microbiome relationship (\u003cspan additionalcitationids=\"CR7\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). The ability to explore these interactions in addition to the taxonomic assignment, allows to shed a light on the human microbiome in both health and disease contexts, unveiling the molecular drivers for diseases, spread of antibiotic resistance, disease-associated genetic elements, individual\u0026rsquo;s health and resilience (\u003cspan additionalcitationids=\"CR7\" citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). Not less relevant are the implications in other environmental contexts, helping to understand why some ecosystem functions are more susceptible than others to successful modification and sustainability (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e). Shotgun metagenomics enables higher-resolution profiling of a community, including the possibility of taxonomic assignment even at the strain level, identification of unknown species through \u003cem\u003ede novo\u003c/em\u003e assembly, the study of gene content, function, and genomic plasticity (\u003cspan additionalcitationids=\"CR12\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e). Nevertheless, genome assembly poses significant technical challenges, due to the complexity of assembling individual bacterial genomes from mixed sequences (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). The reconstruction of nearly complete genomes as MAGs (Metagenome-assembled Genomes) through assembly and binning (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e), could enable the identification of new taxa, genes and metabolic pathways. However, genomes from different bacteria may share highly similar regions, and only when fully assembled to form un-gapped and circularized genomes are comparable to genomes obtained from isolated and pure culture deepening their classification up to the strain level (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). Moreover, an additional challenge arises when dealing with uncultured and uncultivable organisms that have not been sequenced and are therefore absent from reference genome databases. This leads to an increased proportion of reads that cannot be aligned and are consequently classified as unassigned (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). Even from an experimental point of view, shotgun metagenomics approach might face limitations. Low-DNA concentration of metagenomic samples may lead to the use of amplification protocols, increasing the experimental bias (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). Additionally, host-DNA interference can reduce the sensitivity in detecting low-abundance bacterial species. To mitigate this issue, higher sequencing depth is required, which in turn increases overall sequencing costs to achieve adequate microbial genome coverage (\u003cspan additionalcitationids=\"CR19\" citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). Indeed, a more in-depth characterization of microbial communities requires HTS platforms, that include short- and long-read sequencing technologies, able to produce large amounts of data. The short-reads sequencing has dominated microbiome studies until now, thanks to its high quality reads, low-input protocols and high coverage (\u003cspan additionalcitationids=\"CR22 CR23 CR24\" citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e), even if needs suitable fragmentation and amplification steps (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e). On the other side, the long-read sequencing technologies can yield long and ultra-long reads directly from single DNA molecules, despite the low per base accuracy and the higher amount of DNA input required (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan additionalcitationids=\"CR28\" citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e). Several benchmark studies have been conducted over time, comparing second- and third-generation sequencing platforms (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan additionalcitationids=\"CR29\" citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e). These studies serve as essential resources for researchers to better understand the advantages and limitations of each technology. Recently, in metagenomics, the focus has been on taxonomic profiling and meta-assembly (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn this study, we applied the shotgun metagenomics approach to a microbial community with known composition (mock community), leveraging three distinct sequencing platforms to evaluate their performance and limitations. Specifically, we adopted Illumina NovaSeq 6000 for short-read sequencing, alongside PacBio Sequel System IIe and Nanopore GridION for long-read sequencing. Using a commercially available prokaryotic mock community, we established a controlled benchmarking framework to assess sequencing accuracy, coverage and assembly efficiency. Beyond evaluating the limitations of individual sequencing protocols and the ability of different assemblers to reconstruct high-quality MAGs, a key added value of this study is the investigation of microbial taxonomic assignment and gene annotation derived from the recovered genomes. Here, we provide a test case using standards characterized by intra- and inter-species diversity, which not only outlines the strengths and weaknesses of shotgun metagenomics, but also demonstrates how current technologies can be leveraged to infer multiple aspects of the microbiome's \u0026ldquo;dark matter\u0026rdquo;.\u003c/p\u003e"},{"header":"2. MATERIALS AND METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Mock community sample\u003c/h2\u003e \u003cp\u003eThe commercial mock microbial community, ATCC\u0026reg; 20 Strain Even Mix Genomic Material (MSA-1002\u0026trade;, ATCC\u0026reg;, USA, \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.atcc.org/products/msa-1002\u003c/span\u003e\u003cspan address=\"https://www.atcc.org/products/msa-1002\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), was used as a benchmark for the shotgun metagenomic study. It is composed of a mix of genomic DNA belonging to 20 fully sequenced, characterized, and authenticated ATCC Genuine Cultures (5% for each strain) (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e). Fluorometric quantification and genome quality were assessed with dsDNA HS assay for Qubit (ThermoFisher Scientific, Waltham, MA, USA) and Genomic DNA 165 kb Kit for Femto Pulse System (Agilent, Santa Clara, CA, USA) (\u003cb\u003eSupplementary Fig.\u0026nbsp;1A\u003c/b\u003e), respectively. The total yield of the single commercial purchased sample was about 200 ng. Thus, three mix of genomic DNAs, from the same production batch (Lot. 70001383) were used for the different applications as specified below.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2. WGS library preparation and sequencing\u003c/h2\u003e \u003cp\u003eDifferent shotgun metagenomic protocols and sequencing platforms were used in this study, as described below. The mock community DNA was used as input for library preparation and sequenced on NovaSeq6000 (Illumina, San Diego, California, USA), GridION (Oxford Nanopore, Oxford, UK) and Sequel System IIe (PacBio, Menlo Park, California, USA). We used the same input DNA amount as a normalization factor for cross-platform comparison. Moreover, one run unit was assigned per platform\u0026mdash;one flow cell for Illumina and ONT, and one SMRT Cell for PacBio. The mock sample was either multiplexed with other samples or run individually to maximize flow cell capacity. For each platform, the sequencing output and the number of samples multiplexed per FlowCell/SMRT cell are specified below.\u003c/p\u003e \u003cp\u003e \u003cem\u003eIllumina Sequencing\u003c/em\u003e \u003c/p\u003e \u003cp\u003eIllumina DNA Prep kit was used, starting from 200 ng DNA of the mock community, following the protocol instruction (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/illumina_prep/illumina-dna-prep-reference-guide-1000000025416-09.pdf\u003c/span\u003e\u003cspan address=\"https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/illumina_prep/illumina-dna-prep-reference-guide-1000000025416-09.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e). The protocol uses bead-linked transposases to tagment DNA, generating an insert size of ~\u0026thinsp;350bp, and then includes a step of amplification of the tagmented DNA. All the libraries were quality checked through High Sensitivity DNA Assay for 2100 Agilent Bioanalyzer (Agilent, Santa Clara, CA, USA) and quantified using the Qubit dsDNA HS assay (Thermo Fisher Scientific, Waltham, MA, USA). The library was sequenced on the Novaseq 6000 Illumina platform with the 2 \u0026times; 150 bp paired-end sequencing layout (NovaSeq 6000 S4 Reagent Kit v1.5\u0026ndash;300 cycles). The mock sample was loaded in multiplexing with 60 other samples in order to maximize the sequencing capacity of the single S4 flow cell (maximum flow cell output 3Tb). Approximately 16.3 Gb were produced from the mock sample.\u003c/p\u003e \u003cp\u003e \u003cem\u003eNanopore Sequencing\u003c/em\u003e \u003c/p\u003e \u003cp\u003eAbout 200 ng of DNA were used as input for Genomic DNA Ligation Sequencing kit (ONT SQK-LSK114) (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://nanoporetech.com/document/genomic-dna-by-ligation-sqk-lsk114?device=GridION\u003c/span\u003e\u003cspan address=\"https://nanoporetech.com/document/genomic-dna-by-ligation-sqk-lsk114?device=GridION\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and sequenced on GridION platform. This protocol allows DNA sequencing without fragmentation and amplification steps. The sample was loaded individually on a single MinION Flow Cell (R10.4.1, maximum flow cell output 50Gb). Nanopore sequencing produced 2.8 Gb for the mock sample.\u003c/p\u003e \u003cp\u003e \u003cem\u003ePacBio Sequencing\u003c/em\u003e \u003c/p\u003e \u003cp\u003eLibrary preparation was performed following PacBio procedure and checklist: \u0026ldquo;Preparing whole genome and metagenome libraries using SMRTbell\u0026reg; prep kit 3.0\u0026rdquo; (PN 102-166-600 - APR2022) starting from about 200 ng of fragmented DNA. According to the manufacturer's instructions, the DNA of the mock community was sheared with 35 speeds by the Megaruptor\u0026reg;3 (Hologic, Inc). The genomic profile of fragmented DNA was assessed with Genomic DNA 165 kb Kit for Femto Pulse System (Agilent, Santa Clara, CA, USA) (\u003cb\u003eSupplementary Fig.\u0026nbsp;1B\u003c/b\u003e). Then, Binding kit 2.2, Internal control 1.0 and Sequel II Sequencing Kit 2.0 were used for sequencing on the PacBio Sequel IIe System. The mock sample was sequenced with 5 multiplexed samples on a single SMRT\u0026reg; Cell 8M (maximum SMRT cell output 30Gb). PacBio sequencing produced approximately 0.8 Gb from the mock sample.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3. Raw data trimming, assembly and, mapping on reference genomes\u003c/h2\u003e \u003cp\u003e \u003cem\u003eIllumina data analysis and assembly\u003c/em\u003e \u003c/p\u003e \u003cp\u003eIllumina raw sequencing data were initially quality checked by using FastQC (v0.11.9) and low-quality reads were trimmed by using trimmomatic (v0.39, PE ILLUMINACLIP LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 ) (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e). Trimmed data were assembled by using two alternative approaches: megaHIT (v1.2.9, --k-list 21, 29, 39, 59, 79, 99, 119, 141 --k-step 10 --min_count 2) (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e) and metaSPAdes (v3.15.5, --meta -k 21,29,39,59,79,99,119 -m 500 --phred-offset 33 ) (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cem\u003eNanopore data analysis and assembly\u003c/em\u003e \u003c/p\u003e \u003cp\u003eRaw Nanopore sequencing data were initially quality checked by using pycoQC (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e). Porechop abi (v0.5.0, --ab_initio --format fastq.gz) (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e) was used to identify and trim adapter sequences. Trimmed data were assembled by using metaFlye (v 2.9.2-b1786, --nano-raw \u0026ndash;meta -i 5) (\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e) and metaMDBG (v1.0, asm \u0026ndash;in-ont) (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cem\u003ePacBio data analysis and assembly\u003c/em\u003e \u003c/p\u003e \u003cp\u003ePacBio HiFi data were initially quality checked by using FastQC (v0.11.9). Then, cutadapt (v4.5 ,--overlap 35 -e 0.1 --discard -j 5 --revcomp) (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e) was applied to check the HiFi reads for adapter presence. HiFi reads containing adapters were discarded and excluded for subsequent analysis. Trimmed data were assembled by using metaFlye (v2.9.2-b1786, --pacbio-hifi --meta -i 5) (\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e) and metaMDBG (v1.0, asm \u0026ndash;in-hifi) (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Mapping on reference genomes and reference coverage\u003c/h2\u003e \u003cp\u003eSequencing data were mapped on the 20 prokaryotic strain genomes by using minimap2 (v2.26-r1175). The following presets were applied: Illumina (-ax sr), Nanopore (-ax map-ont -L), and PacBio (-ax map-hifi -L). Through samtools (v1.3.1), sam files were compressed as bam files and sorted. Finally, sorted bam files were used to measure genome coverage through the samtools coverage function (-ff 1284, to exclude unmapped reads and secondary alignments, -d 0, to avoid any limits in coverage counts).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Assembly evaluation, binning and bin refinement\u003c/h2\u003e \u003cp\u003eThe obtained assemblies were evaluated by using metaQUAST (v5.2.0, default parameters) (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e) with the -r option to map contigs and reads on reference genomes. Seqkit (v2.8.2, stats -j10 -t -a) (\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e) was used to retrieve overall data of the obtained contigs.\u003c/p\u003e \u003cp\u003eRegardless of the sequencing and assembly approach, the obtained contigs were binned and the obtained bin refined by using metaWRAP (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e). Initial binning was performed by using metaBAT2 (v2.12.1, min Contig length 1500) (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e), MaxBin2 (2.2.4, min Contig length 1000) (\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e) and CONCOT (v1.0.0, min Contig length 1000) (\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e). Contextual to binning refinement process, inferred MAGs were quality checked by using CheckM (v1.0.18) (\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e) and genomes with a completeness\u0026thinsp;\u0026ge;\u0026thinsp;90% and contamination\u0026thinsp;\u0026le;\u0026thinsp;5% were marked as High quality, completeness\u0026thinsp;\u0026ge;\u0026thinsp;50% and contamination\u0026thinsp;\u0026le;\u0026thinsp;10% as medium, otherwise low quality (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.6 MAGs comparison to reference genomes\u003c/h2\u003e \u003cp\u003eObtained MAGs were compared to the reference genome by using MASH (v2.3, sketch -k 21 -s 15000) (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e). Both reference genomes and MAGs were sketched in 15,000 minhash and \u0026ldquo;all versus all\u0026rdquo; comparisons were performed. Moreover, a phylogenetic comparison of the obtained MAGs was obtained by using GTDB-tk (v2.1.1) (\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e). Finally, MAGs were taxonomically classified by using kMetaShot (v2.0, default options) (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.7 MAGs Dereplication\u003c/h2\u003e \u003cp\u003eThe obtained MAGs with at least medium overall quality and the ATCC reference genomes were dereplicated by using the dRep (v3.5.0, dereplicate --ignoreGenomeQuality --genomeInfo) tool (\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e). Considering the presence of two pairs of co-generic species in the employed mock, the Ward algorithm (\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e) for hierarchical clustering was applied, to minimize the within-cluster variance.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e2.8 MAGs Genes Annotation\u003c/h2\u003e \u003cp\u003eAnnotation of both inferred MAGs with at least medium overall quality and Reference genomes downloaded from ATCC (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.atcc.org/\u003c/span\u003e\u003cspan address=\"https://www.atcc.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e, accessed on 1 March 2021) was performed by using Bakta (v1.4.0, --min-contig-length 200) (\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e). The proteins length profile inferred in MAGs was compared to those obtained in ATCC reference genomes by performing pairwise Wilcoxon test. Proteins labelled as hypothetical were excluded from these comparisons. Annotated protein products were compared between reference genomes and obtained MAGs both qualitatively, by numbering the number of common predicted protein function and those private in reference and MAGs, and quantitively by measuring the Jaccard distance. Jaccard distance was measured by using an \u003cem\u003ein house\u003c/em\u003e develop Python script.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003e2.9 MAGs quantification\u003c/h2\u003e \u003cp\u003eConsidering the variability in genome size of the species in the mock community (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e) and the fact an equal amount of genomic DNA was added to the mix, the number of expected genomes copies was estimated to infer the expected relative abundances. We estimated the mass of each genome in ng (nGM\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e), by considering the average weight of a base pair in dsDNA is 607.4 g/mol.\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:{nGM}_{i}=\\:\\frac{{GenomeLength}_{i}*607.4\\:\\left(\\frac{g}{mol}\\right)}{6.022*{10}^{23}\\:\\left({mol}^{-1}\\right)}*{10}^{9}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eFinally, we estimated the Genome Copy Number for each species \u003cem\u003ei\u003c/em\u003e (GCN\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e) considering the same amount of genomic DNA was added to the mixture (i.e. 10 ng):\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:{GCH}_{i}=\\frac{10\\:}{{nGM}_{i}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe estimated genomic copies per each species and the corresponding relative abundances are shown in \u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e.\u003c/p\u003e \u003cp\u003eFollowing, trimmed reads were mapped on the obtained MAGs by using minimap 2 (using the same options listed in section \u003cspan refid=\"Sec6\" class=\"InternalRef\"\u003e2.4\u003c/span\u003e). MAGs coverage was estimated by using the samtools coverage function (-ff 1284, to exclude unmapped reads and secondary alignments, -d 0, to avoid any limits in coverage counts).\u003c/p\u003e \u003c/div\u003e"},{"header":"3. RESULTS","content":"\u003cp\u003e \u003cb\u003e3. 1 Sequencing throughput\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe sequencing data obtained for each technology (Illumina, Nanopore and PacBio), as pre- and post- adapter trimming, are shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eStatistics regarding raw and trimmed sequencing data.\u003c/b\u003e For each sequencing technology the following data are shown: i) N. of seqs: number of produced reads; ii) Yield: total throughput in bases; iii) Min len: minimum sequence length in bp; iv) Avg len: average read length in bp; v) Median len: median read length; vi) Max len.: maximum sequence length in bp.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSeq technologies\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eN. of reads\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eYield\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMin len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAvg len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eMedian len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMax len\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eIllumina\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eraw data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e115,746,688\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e16,353,634,835\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e35\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e141.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e151\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e151\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etrimmed data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e108,028,131\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e15,066,924,022\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e139.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e151\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e151\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eNanopore\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eraw data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2,301,340\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2,806,148,731\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1,219.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e489\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e918,116\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etrimmed data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e2,299,453\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2,724,939,449\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1,185\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e455\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e918,116\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePacBio\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eraw data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e111,629\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e805,182,430\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e258\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e7,213\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e6737\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e24,210\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003etrimmed data\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e111,626\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e805,164,864\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e258\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e7,213.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e6737\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e24,210\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eBoth Nanopore and PacBio produced longer reads than Illumina with an average length of 1 and 7 kb, respectively. Nanopore produced the longest read, spanning about 0.9Mb. Considering the amount of retained sequences/bases, 93.33%/92.13% passed the trimming step for Illumina sequencing, while 99.91%/97.10% for Nanopore and 99.99%/99.99% for PacBio.\u003c/p\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Genome coverage\u003c/h2\u003e \u003cp\u003eBefore performing the assembly, trimmed reads were mapped on reference genomes to evaluate the average coverage and sequencing depth. Initially we estimated the mean coverage of reference genomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The three sequencing technologies produced a variable sequencing depth, with Illumina yielding the highest results (median 441.01, IQR\u0026thinsp;=\u0026thinsp;152.17, mean 1280.96), two orders of magnitude higher than Nanopore (median 29.91, IQR\u0026thinsp;=\u0026thinsp;106.71, mean 88.56) and PacBio (median 12.84, IQR\u0026thinsp;=\u0026thinsp;5.7, mean 18.43). Moreover, all the mock genomes were completely covered by Illumina (median 100, IQR\u0026thinsp;=\u0026thinsp;0, mean 94.43) and Nanopore (median 100, IQR\u0026thinsp;=\u0026thinsp;0, mean 94.43) (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). PacBio (median 100, IQR\u0026thinsp;=\u0026thinsp;0, mean 93.43) fully covered 19 out of 20 genomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Specifically, \u003cem\u003eSchaalia odontolytica\u003c/em\u003e achieved an overall coverage and depth around 92.4% and 2.7X in PacBio sequencing, in contrast to 384x and 28X obtained from Illumina and Nanopore sequencing, respectively (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eMoreover, we also evaluated the coverage breadth by measuring the proportion of reference genomes covered 20X, 30X, 40X and 50X (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). By using Illumina sequencing, all the genomes were completely covered at 50X. When considering long read sequencing, only \u003cem\u003eEscherichia coli\u003c/em\u003e with Nanopore was completely covered at 40X. Overall, for 4 species (namely \u003cem\u003eStreptococcus agalactiae\u003c/em\u003e, \u003cem\u003eStreptococcus mutans\u003c/em\u003e, \u003cem\u003ePhocaeicola vultgatus\u003c/em\u003e and \u003cem\u003eDeinococcus radiodurans\u003c/em\u003e) we observed that less the 50% of the genomes was covered at 20X. Finally, with PacBio sequencing none of the genome was completely covered at 20X.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e3.3 Assembly Evaluation\u003c/h2\u003e \u003cp\u003eTwo different assembly algorithms were used for each sequencing technology: megaHIT and metaSPAdeds for short-reads, metaFlye and metaMDBG for long-reads. The assembly summary statistics are shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eAssembly statistics for megaHIT, metaSPAdes, metaFlye and metaMDB\u003c/b\u003e: i) Contig: number of produced contigs; ii)Tot Len: assembly total length in bp; iii) % Expected Tot Len: obtained fraction (%) of the sum of lengths for reference mock genomes; iv) Min length: shortest contig length in bp; v) Avg len: average contigs length in bp; vi) Max len: maximum contig length in bp; vii) N50; viii) GC(%).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAssembly\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eContig\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTot Len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e% Expected Tot Len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMin len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eAvg len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMax len\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eN50\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003e%GC\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003eIllumina\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emegaHIT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1,135\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e65,990,405\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98.43%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e773\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e58,141\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e955,220\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e174,256\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e47.10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaSPAdes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2,144\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e66,587,045\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e99.32%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e120\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e31,057\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e1,405,032\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e232,008\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e47.10\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eNanopore\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaFlye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e113\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e67,286,575\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e100.37%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1,002\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e595,456.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e6,374,455\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e2,227,176\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e47.18\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaMDBG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e4,838\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e80,148,321\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e119.55%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e267\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e16,566.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e4,642,452\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e41,003\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e47.19\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePacBio\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaFlye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e358\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e60,826,078\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e90.73%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e3,125\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e169,905.20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e6,374,538\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e1,841,921\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e47.09\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaMDBG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e581\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e65,931,520\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e98.34%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1,097\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e113,479.40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e6,374,527\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e2,032,857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e47.06\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eConsidering assembly contiguity, Illumina data obtained N50 in the order of hundreds kilobases, regardless the applied assemblers. The widest contigs length distribution was obtained with metaSPAdes, ranging from 120 bp to 1.4 Mbp (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Regarding long-reads, the same assembly algorithms behave differently depending on the analysed data. When using Nanopore data, metaFlye achieved a N50 of 2 Mbp, much longer than 41 kbp for metaMDBG. Moreover, metaMDBG produced the largest number of contigs, with very short ones (267 bp) (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). However, for PacBio data the N50 values obtained with metaFlye and metaMDBG were similar, both reaching at least 2 Mb, as well as the number of contigs. In this case, both assemblers were able to produce contigs longer than 1 kbp (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). The obtained contigs were evaluated by using MetaQUAST, comparing metagenome assemblies based on alignments to the closest reference genome. The number of contigs covering each genome and the observed coverage are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Overall, regardless the applied assembly algorithm, all the reference genomes were broadly covered with Illumina data, although the assemblies tended to be more fragmented. The lowest coverage (92.6%) was observed for \u003cem\u003ePorphyromonas gingivalis\u003c/em\u003e using megaHIT, which produced 104 contigs. In contrast,\u003cem\u003ePseudomonas aeruginosa\u003c/em\u003e achieved the highest coverage (99.6%) with just 27 contig using metaSPAdes (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e\u003c/b\u003e). Notably, \u003cem\u003eCutibacterium acnes\u003c/em\u003e was assembled at 99.4% coverage and the lowest number of contigs (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e) using both assemblers. For Nanopore data, metaFlye produced the most contiguous assemblies with 4 out 20 genomes assembled at 100% coverage. Specifically, the genome of \u003cem\u003eEscherichia coli\u003c/em\u003e was assembled into a single contig with complete coverage. Among the remaining genomes, 7 were assembled with \u0026gt;\u0026thinsp;99.9% coverage, including 3 single-contig assemblies, and 9 were assembled with \u0026gt;\u0026thinsp;98.2% coverage, each comprising at least 3 contigs. By contrast, metaMDBG produced the most fragmented assemblies from Nanopore data (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e\u003c/b\u003e). Considering PacBio sequencing data, 4 out 20 genomes were assembled at 100% by metaFlye, including the genome of \u003cem\u003eE. coli\u003c/em\u003e with only 1 contig. In this case, only 3 genomes were reconstructed at \u0026ge;\u0026thinsp;99.9%, meanwhile 10 genomes were assembled at \u0026gt;\u0026thinsp;90%. \u003cem\u003eBifidobacterium adolescentis\u003c/em\u003e, \u003cem\u003eBacillus pacificus\u003c/em\u003e and \u003cem\u003eSchaalia odontolytica\u003c/em\u003e had the less complete assemblies, at 60.1%, 48.9% and 14.1%, respectively. MetaMDBG produced different results with PacBio data. In particular, 6 out 20 genomes were covered at 100%, including \u003cem\u003eCutibacterium acnes, Escherichia coli, Helicobacter pylori, Pseudomonas aeruginosa\u003c/em\u003e and \u003cem\u003eStreptococcus mutans\u003c/em\u003e, each assembled in a single contig. About the others, 4 genomes were assembled at \u0026gt;\u0026thinsp;99.9%, and 8 genomes at \u0026gt;\u0026thinsp;95.6% with at least 3 contig. Finally, compared to the metaFlye assemblies, the percentage of assembled genome coverage of \u003cem\u003eB. pacificus\u003c/em\u003e and \u003cem\u003eS. odontolytica\u003c/em\u003e increased to 92.5% and \u003cem\u003e76.7%\u003c/em\u003e, respectively (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e)\u003c/b\u003e. The largest contig obtained for both long-read sequencing approaches, Nanopore and PacBio, corresponded exactly to the \u003cem\u003ePseudomonas aeruginosa\u003c/em\u003e genome (6,374,461 bp).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e3.4 MAGs evaluation\u003c/h2\u003e \u003cp\u003eContigs binning and refinement was performed by using MetaWRAP relying on MetaBAT2, CONCOCT and MaxBin2 that produces MAGs integrating the bins obtained by each approach. Then the MAGs quality was evaluated by using CheckM and genomes with a completeness\u0026thinsp;\u0026ge;\u0026thinsp;90% and contamination\u0026thinsp;\u0026le;\u0026thinsp;5% were marked as High quality, completeness\u0026thinsp;\u0026ge;\u0026thinsp;50% and contamination\u0026thinsp;\u0026le;\u0026thinsp;10% as medium, otherwise low. The results obtained are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. MAGs were taxonomically annotated by using kMetaShot and a phylogenetic three including reference ATCC genomes was built through GTDBtk.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eSummary of the number and quality of obtained MAGs per each assembly approach.\u003c/b\u003e High quality Genomes: completeness\u0026thinsp;\u0026ge;\u0026thinsp;90% and contamination\u0026thinsp;\u0026le;\u0026thinsp;5%; Medium quality Genomes: completeness\u0026thinsp;\u0026ge;\u0026thinsp;50% and contamination\u0026thinsp;\u0026le;\u0026thinsp;10%; Low quality Genomes: completeness\u0026thinsp;\u0026lt;\u0026thinsp;50% and contamination\u0026thinsp;\u0026gt;\u0026thinsp;10%\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAssembly\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003eNumber of MAGs\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cem\u003eQuality of MAGs\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eIllumina\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emegaHIT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16 high; 2 medium\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaSPAdes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e15 high; 5 medium\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eNanopore\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaFlye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e18 high\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaMDBG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e4 high; 13 medium; 7 Low\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ePacBio\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaFlye\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e19\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e17 high; 1 medium; 1 Low\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003emetaMDBG\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e19 high, 1 medium\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003ekMetaShot was able to classify all the obtained MAGs, regardless of their quality, and the obtained classification corresponded to the expected species. Nonetheless, it is worthy to note that \u003cem\u003eRhodobacter sphaereoides\u003c/em\u003e and \u003cem\u003ePropionibacterium acnes\u003c/em\u003e were renamed as \u003cem\u003eCereibacter sphaeroides\u003c/em\u003e (\u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e51\u003c/span\u003e) and \u003cem\u003eCutibacterium acnes\u003c/em\u003e (\u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e52\u003c/span\u003e), respectively. Finally, \u003cem\u003eBacillus pacificus ATCC 10987\u003c/em\u003e in the NCBI taxonomy is annotated as \u003cem\u003eBacillus cereus ATCC 10987\u003c/em\u003e, and concordantly labelled by kMetaShot.\u003c/p\u003e \u003cp\u003eAll the expected genomes were correctly recovered only when assembling Illumina data with metaSPAdes and PacBio data by metaMDBG (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). In contrast, when using megaHIT we were unable to retrieve genomes of \u003cem\u003eStreptococcus agalactiae\u003c/em\u003e and \u003cem\u003eStaphylococcus epidermidis\u003c/em\u003e, both corresponding to genera represented by two species in the mock community (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e. With Nanopore data processed by metaFlye we obtained 18 high quality MAGs, but the two \u003cem\u003eStaphylococcus spp.\u003c/em\u003e were missing. All the expected species were retrieved by using metaMDBG contigs, although according to kMetaShot classification two MAGs were observed for \u003cem\u003eCutibacterium acnes\u003c/em\u003e (high and low), \u003cem\u003eHelicobacter pylori\u003c/em\u003e (medium, low), \u003cem\u003ePorphyromonas gengivalis\u003c/em\u003e (both low) and \u003cem\u003eStaphylococcus epidermidis\u003c/em\u003e (medium, low). This resulted in an overestimation of the number of MAGs, from 20 to 24 (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Regarding PacBio data, by using metaFlye the only missing species was \u003cem\u003eSchaalia odontolytica.\u003c/em\u003e When assembling PacBio data with metaMDBG, all the expected genomes were retrieved, with 19 MAGs with high quality and 1 MAGs with medium quality, represented by \u003cem\u003eSchaalia odontolytica (\u003c/em\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFurthermore, considering both kMetaShot classification and GTDBtk phylogeny are supervised approaches relying on reference genomes collection and taxonomy, an unsupervised clustering and dereplication through ANI (Average Nucleotide Identity) was performed with dRep. It relies on a two steps approach, applying a concise ANI inference trough sketches with MASH to infer primary clusters (ANI\u0026thinsp;\u0026ge;\u0026thinsp;90%) followed by a secondary clustering through a precise ANI estimation by fastANI (ANI\u0026thinsp;\u0026ge;\u0026thinsp;95%). Finally, following secondary clustering completion a reference genome per cluster was defined by taking into account genomic features (completeness, contamination, size, and strain heterogeneity), assembly and clustering quality metrics (N50 and centrality). Dereplication results excluding low quality MAGs are shown in \u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e\u003c/b\u003e and secondary clustering representative genomes are labelled in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. dRep identified 20 clusters, one per each mock species, and a complete correspondence among clusters and kMetaShot taxonomic classification was observed. Regarding secondary clustering reference genomes and the choice of the reference genome per cluster, 7 out 20 were chosen among ATCC reference ones, while the other 13 were chosen among the MAGs obtained through the assembly of PacBio and Nanopore data (2 Nanopore\u0026thinsp;+\u0026thinsp;metaFlye, 6 PacBio\u0026thinsp;+\u0026thinsp;metaMDBG, and 5 PacBio\u0026thinsp;+\u0026thinsp;metaFlye).\u003c/p\u003e \u003cp\u003eThe obtained MAGs genome sizes were compared with those of the reference genomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e) by using the kMetaShot classification as steering. Regardless sequencing technology and assembly methods, the genome sizes of \u003cem\u003eRhodobacter sphaeroides\u003c/em\u003e and \u003cem\u003eDeinococcus radiodurans\u003c/em\u003e were overestimated compared to reference genome.\u003c/p\u003e \u003cp\u003eFor short reads, \u003cem\u003eBifidobacterium adolescentis\u003c/em\u003e genome size was overestimated with both metaSPAdes and megaHIT, and \u003cem\u003eBacillus pacificus\u003c/em\u003e displayed the same trend when using megaHIT. The assembled genome sizes of \u003cem\u003eBacillus pacificus\u003c/em\u003e, \u003cem\u003eBifidobacterium adolescentis\u003c/em\u003e, \u003cem\u003eHelicobacter pylori\u003c/em\u003e, and \u003cem\u003ePorphyromonas gingivalis\u003c/em\u003e by using metaFlye exceeded the expected size. By contrast, \u003cem\u003eAcinetobacter baumannii\u003c/em\u003e, \u003cem\u003ePhoaecicola vulgatus, Staphylococcus aureus\u003c/em\u003e subsp. \u003cem\u003eaureus\u003c/em\u003e and \u003cem\u003eStaphylococcus epidermidis\u003c/em\u003e MAGs showed a similar trend in PacBio assemblies regardless the applied assembler (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e).\u003c/p\u003e \u003cp\u003eShort-read sequencing allowed complete reconstruction with a 100% match to the reference of only 2 MAGs, whereas long-read sequencing approaches enabled the binning with perfect match of 8 complete MAGs using Nanopore data (metaFlye) and 8 complete MAGs using PacBio data (metaMDBG) (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e).\u003c/p\u003e \u003cp\u003eFinally, considering the mock bacterial species belonging to the same genus, we observed that none of the MAGs obtained by assembling Illumina data using megaHIT were identified as \u003cem\u003eStaphylococcus epidermidis\u003c/em\u003e or \u003cem\u003eStaphylococcus aureus\u003c/em\u003e. On the contrary, \u003cem\u003eStreptococcus agalactiae\u003c/em\u003e and \u003cem\u003eStreptococcus mutans\u003c/em\u003e were taxonomically identified, although only the latter had a genome size matching 100% of the reference genome. A different result was obtained when using Illumina data assembled with metaSPAdes. In this case, we identified both \u003cem\u003eStaphylococcus\u003c/em\u003e and \u003cem\u003eStreptococcus\u003c/em\u003e species, but with a lower match to their respective reference genomes (\u003cem\u003eS. epidermidis\u003c/em\u003e 55.2% and \u003cem\u003eS. aureus\u003c/em\u003e 40.4%, \u003cem\u003eS. agalactiae\u003c/em\u003e 82.7% and \u003cem\u003eS. mutans\u003c/em\u003e 58.7%) (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e). Regarding long-read sequencing approaches, the results obtained with Nanopore data were like those observed with short reads. Specifically, no MAGs derived from Nanopore data assembled with metaFlye, were identified as \u003cem\u003eS. epidermidis\u003c/em\u003e and \u003cem\u003eS. aureus.\u003c/em\u003e Conversely, the MAGs classified as \u003cem\u003eS. agalactiae\u003c/em\u003e and \u003cem\u003eS. mutans\u003c/em\u003e matched the reference genomes at 99.3% and 98.8%, respectively. However, the analysis of Nanopore sequencing data using metaMDBG allowed the reconstruction of MAGs and classification of both \u003cem\u003eStaphylococcus\u003c/em\u003e and \u003cem\u003eStreptococcus\u003c/em\u003e species despite a lower correspondence to their reference genomes (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e). In the case of PacBio data assembled with both metaFlye and metaMDBG, \u003cem\u003eStaphylococcus epidermidis\u003c/em\u003e and \u003cem\u003eStaphylococcus aureus\u003c/em\u003e MAGs were taxonomically classified, although with an overestimation in genome size compared to the reference. Meanwhile, \u003cem\u003eStreptococcus agalactiae\u003c/em\u003e MAGs matched the reference genome at 99.7% (metaFlye) and 99.8% (metaMDBG), whereas \u003cem\u003eStreptococcus mutans\u003c/em\u003e matched 100% (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFurthermore, we used MASH to measure the distance between the obtained MAGs and their corresponding reference genomes (Supplementary Fig.\u0026nbsp;2). Regardless of the sequencing technology or assembly approach applied, high quality MAGs showed a distance from reference genomes below 1%. For medium quality MAGs, MASH distances within 2% were observed, with the only exception of \u003cem\u003eStaphylococcus aureus\u003c/em\u003e. In this case, the two MAGs assembled from Illumina data (i.e. mash distance of 1,9% with megaHIT and 8.7% with metaSPAdes) were the least similar to the refence genome, even when compared to the low quality MAG assembled from Nanopore reads using metaMDBG (1.4%).\u003c/p\u003e \u003cp\u003eFinally, we compared the expected species abundances (\u003cb\u003eSupplementary Table\u0026nbsp;1\u003c/b\u003e) with those inferred from MAGs coverage and relative abundances (\u003cb\u003eSupplementary Fig.\u0026nbsp;3\u003c/b\u003e) across each combination of sequencing technologies and assembly approaches. Regardless of the sequencing strategy and assembly method, substantial discrepancies were observed between observed and expected abundances.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e3.5 MAGs Genes Annotation\u003c/h2\u003e \u003cp\u003eMedium and High-quality MAGs were functionally annotated by using Bakta. To avoid discrepancy due to different annotation pipelines, also ATCC reference genomes were re-annotated by using the same tool. Initially, the number of annotated gene types (i.e. cds, tRNA, rRNA, ncRNA and tmRNA) were compared \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e. Overall, a comparable number of cds were obtained in MAGs and ATCC reference genomes, with few differences. An underestimation of cds was observed in 8 out of 20 species (namely \u003cem\u003eH. pylori\u003c/em\u003e, \u003cem\u003eL. gasseri\u003c/em\u003e, \u003cem\u003eN. meningitidis\u003c/em\u003e, \u003cem\u003eP. gengivalis\u003c/em\u003e, \u003cem\u003eS. aureus\u003c/em\u003e, \u003cem\u003eS. epidermidis\u003c/em\u003e, \u003cem\u003eS agalactiae\u003c/em\u003e, and \u003cem\u003eS. mutans\u003c/em\u003e) when MAGs obtained from short reads are considered, regardless the assembly methods \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). Regarding long reads, the number of predicted CDS was not influenced by assembly methods in PacBio data, while for Nanopore data metaMDBG tended to produce less accurate annotation compared to metaFlye \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eUsing MAGs retrieved from long reads, both Nanopore and PacBio, we observed an overall tendency to predict a number of ncRNA genes like those observed in reference genomes. Moreover, the number of predicted ncRNA genes in long-reads derived MAGs resulted influenced by the assembly quality and reference genome coverage \u003cb\u003e(\u003c/b\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003cb\u003e)\u003c/b\u003e. For instance, for MAGs obtained by assembling Nanopore data through metaMDBG and assigned to \u003cem\u003eC. beijerinckii\u003c/em\u003e and \u003cem\u003eN. meningitidis\u003c/em\u003e (both medium quality and with a reference genome coverage\u0026thinsp;\u0026le;\u0026thinsp;90%) an underestimation of both rRNA and tRNA was observed. Similarly, \u003cem\u003eS. odontolytica\u003c/em\u003e MAGs obtained with PacBio data were the least accurate in terms of ncRNA genes annotation because of lower MAGs completeness (73.12%, \u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e). An underestimation of annotated ncRNA genes was observed in MAGs obtained from short reads, regardless of the assembly method. The impact of the underestimation was associated to both MAGs completeness and genes redundancy (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM5\" class=\"InternalRef\"\u003eS5\u003c/span\u003e\u003c/b\u003e). Indeed, for both Illumina inferred MAGs classified as \u003cem\u003eL. gasseri\u003c/em\u003e (both medium quality) the number of tRNA genes was underestimated compared to the reference genome and not rRNA genes were identified at all (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM6\" class=\"InternalRef\"\u003eS6\u003c/span\u003e\u003c/b\u003e). Regarding genes redundancy, the number of genes for Alanine and Isoleucine tRNAs were underestimated in 14 out 20 species (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM6\" class=\"InternalRef\"\u003eS6\u003c/span\u003e\u003c/b\u003e). Despite both \u003cem\u003eA. baumani\u003c/em\u003e MAGs retrieved from short reads were classified as high quality, both were able to retrieve just 1 out of 7 expected Alanine tRNA genes (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM6\" class=\"InternalRef\"\u003eS6\u003c/span\u003e)\u003c/b\u003e. A comparison of these seven genes demonstrated 6 were identical while one was unique, sharing a 71% of similarity with the others. The Alanine tRNA gene retrieved for both MAGs corresponded to the unique one in reference genomes (\u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM6\" class=\"InternalRef\"\u003eS6\u003c/span\u003e\u003c/b\u003e).\u003c/p\u003e \u003cp\u003eFurthermore, we also evaluated the quality of protein coding genes annotations by comparing the protein length profiles between MAGs and ATCC reference genomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). Overall, no relevant differences were observed in the length profiles of MAGs retrieved from short reads compared to those of reference genomes, with the only exception of \u003cem\u003eL. gasseri\u003c/em\u003e (medium quality MAGs). Considering MAGs obtained by binning metaMDBG and metaFlye contigs from Nanopore reads, we observed statistically relevant differences for 14 and 10 out of 20 species, respectively, 8 in common (namely C. \u003cem\u003esphaeroides\u003c/em\u003e, \u003cem\u003eD. radiodurans\u003c/em\u003e, \u003cem\u003eH. pylori\u003c/em\u003e, \u003cem\u003eN. meningitidis\u003c/em\u003e, \u003cem\u003eP. vulgatus, S. odontolytica\u003c/em\u003e, \u003cem\u003eS. agalactiae\u003c/em\u003e, and \u003cem\u003eS.mutans\u003c/em\u003e). Finally, regarding MAGs retrieved from PacBio we observed statically significant differences in protein length profiles in 4 out of 20 species (\u003cem\u003eB. cereus group\u003c/em\u003e, \u003cem\u003eB. adolescentis\u003c/em\u003e, \u003cem\u003eC. beijerinckii\u003c/em\u003e, and \u003cem\u003eS. odontolytica\u003c/em\u003e). Specifically, for all these 4 species metaMDBG MAGs were evaluated as medium quality, while the only MAGs obtained with metaFlye (\u003cem\u003eC. beijerinckii\u003c/em\u003e) reached a high-quality classification.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFinally, we evaluated the annotated protein genes by comparing the predicted functions in ATCC reference genomes to the obtained MAGs. A qualitative representation of the obtained results is available in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e. PacBio retrieved MAGs obtained the closest results compared to reference genomes (metaFlye Jaccard: mean 4.02%, median 0.65%, metaMDBG Jaccard: mean 4.39%, median 1.10%) with only three species showing relevant differences: \u003cem\u003eS. odontolytica\u003c/em\u003e (metaMDBG, medium quality genome), \u003cem\u003eB. pacificus\u003c/em\u003e (B. cereus group, metaMDBG) and \u003cem\u003eB. adolescentis\u003c/em\u003e (metaFlye). These data supported the observed reference genome coverage and assembly quality (Supplementary Table\u0026nbsp;4). Furthermore, three metaFlye assembled MAGs, classified as \u003cem\u003eS. mutans, P. aeruginosa\u003c/em\u003e and \u003cem\u003eE. coli\u003c/em\u003e, were the only to achieve a Jaccard distances from reference equal to 0.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFollowing, acceptable results were also obtained in Illumina retrieved MAGs (metaSPAdes Jaccard mean: 12.03% median: 4.22%; megaHIT Jaccard mean: 6.42% median: 3.70%). The species achieving the largest distance from reference genome were \u003cem\u003eStaphylococcus spp.\u003c/em\u003e, \u003cem\u003eStreptococcus spp.\u003c/em\u003e and \u003cem\u003eL. gasserii\u003c/em\u003e. Nanopore data assembled through metaFlye obtained comparable results to those obtained with short-reads (Jaccard mean: 8.73% median: 6.75%). \u003cem\u003eH. pylori\u003c/em\u003e MAGs obtained the largest dissimilarity from reference genome (41.2%). Finally, Nanopore MAGs obtained through metaMDBG assembly were the furthest from the reference genomes (Jaccard mean: 24.00% median: 26.05%).\u003c/p\u003e \u003c/div\u003e"},{"header":"4. DISCUSSION","content":"\u003cp\u003eOur understanding of the complex network between microorganisms and surrounding environment, including diversity, structure and dynamics of microbial communities, is still incomplete, due to the challenges that metagenomic studies face during library preparation, sequencing and analysis steps (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e). Shotgun metagenomics is an informative approach to rapidly obtain a compositional profiling of the investigated microbial community (i.e. reference based) or to retrieve nearly complete microorganism genomes (i.e. assembly based) or MAGs. The latter is gaining an ever-growing interest in the research community also due to the decreasing sequencing costs (\u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e54\u003c/span\u003e). In this benchmark study the impact of different sequencing technologies and metagenome assembler tools have been thoroughly evaluated to measure their impact on high-quality MAGs retrieval. Cutting-edge sequencing platforms, NovaSeq 6000 (Illumina), GridION (Nanopore) and Sequel IIe (PacBio) have been employed in this work. Moreover, functional annotation and concordance evaluation among reference and retrieved genomes has been investigated to find the best performing sequencer-assembler combination. All the aspects here discussed, introduced an updated point of view with respect to previous works (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e). A commercial mock composed of 20 bacterial species belonging to 18 genera was used, with a total DNA yield of 200 ng. Limited total DNA yield and presence of phylogenetically related species (i.e. co-generic) reflect possible conditions that may occur naturally in biological samples. Moreover, this mock community is represented by a limited number of microorganisms allowing, by contrast, a more thoroughly investigation of technical aspects. For short-read sequencing (Illumina) a method involving chemical DNA fragmentation and PCR-based amplification step during library preparation has been used. As a result, Illumina generated the highest data output, reaching hundreds of millions of reads, a coverage of 100X and the highest sequencing depth for all mock genomes. At the same time, we selected the available amplification-free protocols for long-read sequencing to align with the goal of single-molecule sequencing, producing reads longer than 10 kb, removing amplification bias while preserving base modification (\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e). The ligation-sequencing protocol without fragmentation-step was chosen for Nanopore application, meanwhile we sequenced medium-sized fragments obtained from mechanical fragmentation (about 12kbp peak size) for the PacBio application. In both cases, the used input yield did not completely meet protocol requirements. Nonetheless, we obtained an adequate number of reads to perform bioinformatic analysis, with Nanopore producing a higher yield and longer reads compared to PacBio. PacBio HiFi sequencing, however, generated fewer reads overall but the trimming procedures did not affect their number, reflecting the inherently high base accuracy of HiFi data. This observation is consistent with recent literature showing that, although Nanopore can produce longer and more abundant reads, PacBio HiFi delivers highly accurate long reads that ultimately support superior genome reconstruction and higher-quality metagenome assemblies (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e56\u003c/span\u003e). The reduced number of reads obtained with PacBio sequencing probably affected the coverage and sequencing depth exclusively for the species \u003cem\u003eS. odontolytica\u003c/em\u003e, compared to the other technologies. The strong impact of sequencing depth and uneven coverage on MAG recovery that observed for this low-abundance taxon is consistent with previous studies highlighting coverage as a major limiting factor for metagenome assembly and genome-resolved metagenomics (\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e, \u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e). The effort in library preparation and in pushing on the sequencing yield was crucial to avoid the lack of sufficient sequencing depth and coverage which represents a critical aspect in metagenome reconstruction (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e) and can influence the performance of meta-assembly tools (\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e). In fact, metagenome assembly is the crucial step to retrieve near complete and high-quality MAGs. Two main algorithms have been used in this benchmark: de-Bruijn (\u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e57\u003c/span\u003e) and repeat graphs (\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e). The first is principally used in the case of short reads assembly but application on third generation sequencing technologies are also available (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e). The second is particularly suitable for long reads assembly. Here, two tools for each sequencing technology have been employed to consider the impact of the assembly step. For Illumina short-reads, the de-Bruijn graph-based tools megaHIT and metaSpades have been used, that differ in terms of algorithm heuristics and optimizations. Overall, metaSpades produced assembly with both a larger N50 and contig length than megaHIT. MetaSpades assembly allowed to retrieve all the expected species with a higher quality than megaHIT. In general, metaSpades performed slightly better and more precisely than megaHIT that also missed MAGs for two expected species (i.e. \u003cem\u003eS. agalactiae\u003c/em\u003e, \u003cem\u003eS. epidermidis\u003c/em\u003e). For long reads data the repeat graph-based tool metaFlye and de-Bruijn graph-based tool metaMBDG were employed. We chose two alternative approaches because it is well known that string-graph based approaches are poorly able to catch low-abundance microorganisms and strain-heterogeneity, which negatively impacts MAGs quality (\u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e59\u003c/span\u003e). On the other hand, de Bruijn graph relies on exact k-mers matching which is affected by the long-reads lower accuracy compared to short-one. MetaFlye on Nanopore and metaMDBG on PacBio assemblies reached the highest N50 value (~\u0026thinsp;2Mbp) and coherently obtained the highest number of high-quality MAGs. This higher yield of high-quality MAGs from long-read data is consistent with previous HiFi- and Nanopore-based genome-resolved metagenomic studies, which showed that long and accurate reads substantially increase the number and quality of recovered MAGs compared to short-read assemblies (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e). Furthermore, if Nanopore with both assemblers and PacBio with metaFlye missed MAGs for some expected species, metaMDBG on PacBio retrieved MAGs for all the mock community species. Thus, Illumina-metaSpades and PacBio-metaMDBG combinations allow to appreciate more biodiversity than others distinguishing co-generic species. On the contrary, both Illumina-megaHit and Nanopore-metaMDBG fail in co-generic genomes discrimination. About assembly contiguity the combination of PacBio and metaMDBG obtain the higher number of reference genomes (i.e. 12) covered by the lowest number of contigs, followed by Nanopore and metaFlye (i.e. 8) (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). This demonstrates that repeat graph algorithm metaFlye allows to overcome Nanopore low quality reads and coverage, while metaMDBG sparse de Brujin graph performs better with long high-quality reads from PacBio. In line with earlier platform comparisons, third-generation sequencing generally improves assembly contiguity compared to Illumina short reads (\u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e55\u003c/span\u003e). Notably, when comparing the obtained MAGs with their reference genomes using sketch-based distances, higher MAG qualities corresponds to lower measured distances. Dereplication of obtained MAGs and reference ATCC genomes through ANI estimation produced intriguing results with 13 out of 20 reference genomes chosen among long reads generated MAG (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, \u003cb\u003eSupplementary Table \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e\u003c/b\u003e). This is a further demonstration that retrieved MAGs particularly from PacBio-metaMDBG combination are more accurate than reference genomes counterpart. Nonetheless, MAGs' sizes were compared with those of the reference genomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Overall, Illumina data tended to not correctly estimated genome sizes, while Nanopore assembly were influenced by the applied assembler. PacBio data were more consistent with the reference genomes and less affected by the assembly approach. Regardless of sequencing technology and assembly approach, \u003cem\u003eD. radiodurans\u003c/em\u003e and \u003cem\u003eR. sphaeroides\u003c/em\u003e genome sizes were overestimated. This finding could be due to the overall quality of the reference genomes, that are both fragmented. In fact, in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e for these two species two PacBio-metaFlye MAGs were chosen as centroid by dRep dereplication. The key novelty of this work lies in the comprehensive evaluation of how different sequencing technologies and combined assembly strategies influence genome annotation, specifically in terms of number of annotated genes, the length of predicted proteins, and number and types of inferred functions. First, about genes prediction short reads and Nanopore coupled with metaMDBG MAGs tend to underestimate cds and non-coding genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e), principally due to MAGs incompleteness and genes redundancy. Nonetheless, from PacBio MAGs all copies of redundant genes such as Alanine tRNA genes can be annotated, as well as in Nanopore-metaFlye MAGs. Second, about protein length distribution, intriguingly Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows a statistically significant underestimation of protein length for MAGs from long-reads. In particular, 14, 10, 4, 1 and 1 were the MAGs with a protein length underestimation for Nanopore-metaMDBG, Nanopore-metaFlye, PacBio-metaMDBG, PacBio-metaFlye and Illumina-metaSpades combinations, respectively, highlighting the impact of MAGs completeness in functional annotation (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). It is worthy to note how Liu et al., have applied five rounds of genome polishing to retrieve highly accurate MAGs, when by assembling ONT reads with metaFlye and binning with metaWRAP (\u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e60\u003c/span\u003e). This is a bioinformatic procedure aimed to reduce missassemblies by comparing contigs and raw reads. Indeed, as we demonstrated in our results as already reported in literature that ONT assemblies are less accurate compared to PacBio one (\u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e58\u003c/span\u003e). To reduce the impact of missassemblies in genomic projects, several rounds of polishing are suggested, by also employing short-reads (\u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e61\u003c/span\u003e). As demonstrated by Liu et al., applying polishing on metagenomic data becomes computationally expensive and for this reason we decided to do not include also this step in our analysis, but it can explain the lower accuracy in genes annotation in ONT data compared to other technologies. Finally, about comparison of inferred functions from reference genomes \u003cem\u003evs\u003c/em\u003e MAGs, Nanopore-metaMDBG combination was the worst, followed by Illumina, Nanopore-metaFlye and PacBio- metaMDBG with similar performances. Instead, PacBio-metaFlye revealed the best performances and perfect concordance for 3 MAGs.\u003c/p\u003e"},{"header":"5. CONCLUSIONS","content":"\u003cp\u003eHigh-throughput sequencing techniques have revolutionized the field of microbial ecology. Constant improvements in shotgun metagenomics protocols and assemblers hold great potential for advancing our understanding, thanks to the rise of long-read sequencing approaches (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e53\u003c/span\u003e). However, assembling all genomes of all bacteria within a single microbial niche remains challenging (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). Incomplete reference genomes with gaps, horizontal gene transfer between species, and the obstacles posed by low-biomass samples, as tissue microbiomes, are all challenges that still need to be overcome to achieve a genome-resolved metagenomics (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e62\u003c/span\u003e). The goal of the scientific community is to obtain comprehensive insights into microbial diversity, genomic function and interactions among microorganisms. From this perspective, this benchmark study offers a comparative framework of currently available sequencing strategies in metagenomic field and their limitations. Our findings may contribute to future methodological advancements aimed at improving strain-resolved metagenome assemblies for microbial characterization, even in highly complex environments.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eHTS \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;high throughput sequencing technologies\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMAGs \u0026nbsp; \u0026nbsp; \u0026nbsp; Metagenome-assembled Genomes\u003c/p\u003e\n\u003cp\u003eANI \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; Average Nucleotide Identity\u003c/p\u003e\n\u003cp\u003eCDS \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;protein coding DNA sequencing\u0026nbsp;\u003c/p\u003e\n\u003cp\u003etRNA \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;transfer RNA\u003c/p\u003e\n\u003cp\u003erRNA \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;ribosomal RNA\u003c/p\u003e\n\u003cp\u003etmRNA \u0026nbsp; \u0026nbsp; transfer-messenger RNA\u003c/p\u003e\n\u003cp\u003encRNA \u0026nbsp; \u0026nbsp; \u0026nbsp;non-coding RNA\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated during the current study are available in the ENA repository, reference number BioProject PRJEB89875.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported by projects: Life Science Hub Regione Puglia (LSH-Puglia, T4-AN-01 H93C22000560003), INNOVA - Italian network of excellence for advanced diagnosis (PNC-EJ-2022-23683266 PNC-HLS-DA), DARE - DigitAl lifelong pRevEntion initiative (PNC-I.1 \u0026quot;Research initiatives for innovative technologies and pathways in the health and welfare sector\u0026rdquo; D.D. 931 of 06/06/2022, \u0026nbsp;code PNC0000002, CUP: B53C22006420001), and by ELIXIR-IT through the PNRR Project ELIXIRxNextGenIT - ELIXIR x NextGenerationIT: consolidation of the Italian Infrastructure for Omics Data and Bioinformatics (Grant Code IR0000010, CUP:B53C22000690005).\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConceptualization, G.V., E.N., B.F., M.M. and G.P.; methodology, G.V., E.N., G. D, M.F.C., B.F. and M.M.; validation, G.V., E.N., G. D, \u0026nbsp;B.F. and M.M.; formal analysis, G.V., E.N., G. D, \u0026nbsp; B.F. and M.M.; investigation, G.V., E.N., G. D, \u0026nbsp;B.F. and M.M.; resources, G.P.; data curation, G.V., E.N., G. D, \u0026nbsp;B.F. and M.M.; writing\u0026mdash;original draft preparation, G.V., E.N., G. D, \u0026nbsp;B.F. and M.M. writing\u0026mdash;review and editing, G.V., E.N., G. D, M.F.C., B.F., M.M. and G.P.; visualization, E.N., G.V. and G.D.; supervision, G.P.; project administration, B.F. and M.M.; funding acquisition, G.P. All authors have read and agreed to the published version of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eStaley, J. T. \u0026amp; Konopka, A. measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. \u003cem\u003eAnnu. Rev. Microbiol.\u003c/em\u003e \u003cb\u003e39\u003c/b\u003e (1), 321\u0026ndash;346 (1985).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eP\u0026eacute;rez-Cobas, A. E., Gomez-Valero, L. \u0026amp; Buchrieser, C. Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses. \u003cem\u003eMicrob. Genomics\u003c/em\u003e ;\u003cb\u003e6\u003c/b\u003e(8). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBharti, R. \u0026amp; Grimm, D. G. Current challenges and best-practice protocols for microbiome analysis. \u003cem\u003eBrief. Bioinform.\u003c/em\u003e \u003cb\u003e22\u003c/b\u003e (1), 178\u0026ndash;193 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePurushothaman, S., Meola, M. \u0026amp; Egli, A. Combination of Whole Genome Sequencing and Metagenomics for Microbiological Diagnostics. \u003cem\u003eIJMS\u003c/em\u003e \u003cb\u003e23\u003c/b\u003e (17), 9834 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNotario, E. et al. Amplicon-Based Microbiome Profiling: From Second- to Third-Generation Sequencing for Higher Taxonomic Resolution. \u003cem\u003eGenes\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e (8), 1567 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSmillie, C. S. et al. Ecology drives a global network of gene exchange connecting the human microbiome. \u003cem\u003eNature\u003c/em\u003e \u003cb\u003e480\u003c/b\u003e (7376), 241\u0026ndash;244 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, C., Chen, J. \u0026amp; Li, S. C. Understanding Horizontal Gene Transfer network in human gut microbiota. \u003cem\u003eGut Pathog\u003c/em\u003e. \u003cb\u003e12\u003c/b\u003e (1), 33 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJiang, Y. et al. GutMetaNet: an integrated database for exploring horizontal gene transfer and functional redundancy in the human gut microbiome. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cb\u003e53\u003c/b\u003e (D1), D772\u0026ndash;D782 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCarabeo-P\u0026eacute;rez, A., Guerra-Rivera, G., Ramos-Leal, M. \u0026amp; Jim\u0026eacute;nez-Hern\u0026aacute;ndez, J. Metagenomic approaches: effective tools for monitoring the structure and functionality of microbiomes in anaerobic digestion systems. \u003cem\u003eAppl. Microbiol. Biotechnol.\u003c/em\u003e \u003cb\u003e103\u003c/b\u003e (23\u0026ndash;24), 9379\u0026ndash;9390 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSilverstein, M. R., Segr\u0026egrave;, D. \u0026amp; Bhatnagar, J. M. Environmental microbiome engineering for the mitigation of climate change. \u003cem\u003eGlob. Change Biol.\u003c/em\u003e \u003cb\u003e29\u003c/b\u003e (8), 2050\u0026ndash;2066 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAnyansi, C., Straub, T. J., Manson, A. L., Earl, A. M. \u0026amp; Abeel, T. Computational Methods for Strain-Level Microbial Detection in Colony and Metagenome Sequencing Data. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, 1925 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLapidus, A. L. \u0026amp; Korobeynikov, A. I. Metagenomic Data Assembly \u0026ndash; The Way of Decoding Unknown Microorganisms. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e, 613791 (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePinto, Y. \u0026amp; Bhatt, A. S. Sequencing-based analysis of microbiomes. \u003cem\u003eNat. Rev. Genet.\u003c/em\u003e \u003cb\u003e25\u003c/b\u003e (12), 829\u0026ndash;845 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, N. et al. Genome-resolved metagenomics: a game changer for microbiome medicine. \u003cem\u003eExp. Mol. Med.\u003c/em\u003e \u003cb\u003e56\u003c/b\u003e (7), 1501\u0026ndash;1512 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThe Genome Standards Consortium et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. \u003cem\u003eNat. Biotechnol.\u003c/em\u003e \u003cb\u003e35\u003c/b\u003e (8), 725\u0026ndash;731 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDefazio, G., Tangaro, M. A., Pesole, G. \u0026amp; Fosso, B. kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes. \u003cem\u003eBrief. Bioinform.\u003c/em\u003e \u003cb\u003e26\u003c/b\u003e (1), bbae680 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSabina, J. \u0026amp; Leamon, J. H. Bias in Whole Genome Amplification: Causes and Considerations. In: (ed Kroneis, T.) Whole Genome Amplification [. New York, NY: Springer New York; 15\u0026ndash;41. (Methods in Molecular Biology; vol. 1347). (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNelson, M. T. et al. Human and Extracellular DNA Depletion for Metagenomic Analysis of Complex Clinical Infection Samples Yields Optimized Viable Microbiome Profiles. \u003cem\u003eCell. Rep.\u003c/em\u003e \u003cb\u003e26\u003c/b\u003e (8), 2227\u0026ndash;2240e5 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePereira-Marques, J. et al. Impact of Host DNA and Sequencing Depth on the Taxonomic Resolution of Whole Metagenome Sequencing for Microbiome Analysis. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cb\u003e10\u003c/b\u003e, 1277 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcArdle, A. J. \u0026amp; Kaforou, M. Sensitivity of shotgun metagenomics to host DNA: abundance estimates depend on bioinformatic tools and contamination is the main issue. \u003cem\u003eAccess. Microbiol.\u003c/em\u003e ;\u003cb\u003e2\u003c/b\u003e(4). (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLatorre-P\u0026eacute;rez, A., Pascual, J., Porcar, M. \u0026amp; Vilanova, C. A lab in the field: applications of real-time, in situ metagenomic sequencing. \u003cem\u003eBiology Methods Protocols\u003c/em\u003e. \u003cb\u003e5\u003c/b\u003e (1), bpaa016 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim, H. J. et al. Microbial profiling of peri-implantitis compared to the periodontal microbiota in health and disease using 16S rRNA sequencing. \u003cem\u003eJ. Periodontal Implant Sci.\u003c/em\u003e \u003cb\u003e53\u003c/b\u003e (1), 69 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eArredondo, A. et al. Comparative 16S rRNA gene sequencing study of subgingival microbiota of healthy subjects and patients with periodontitis from four different countries. \u003cem\u003eJ. Clin. Periodontology\u003c/em\u003e. \u003cb\u003e50\u003c/b\u003e (9), 1176\u0026ndash;1187 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMarzano, M. et al. Farnesoid X receptor activation by the novel agonist TC-100 (3α, 7α, 11β-Trihydroxy-6α-ethyl-5β-cholan-24-oic Acid) preserves the intestinal barrier integrity and promotes intestinal microbial reshaping in a mouse model of obstructed bile acid flow. \u003cem\u003eBiomed. Pharmacother.\u003c/em\u003e \u003cb\u003e153\u003c/b\u003e, 113380 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTumolo, M. et al. Linking feed, biodiversity, and filtration performance in a Self-Forming Dynamic Membrane BioReactor (SFD MBR) treating canning wastewater. \u003cem\u003eJ. Water Process. Eng.\u003c/em\u003e \u003cb\u003e66\u003c/b\u003e, 106031 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRoy, G., Prifti, E., Belda, E. \u0026amp; Zucker, J. D. Deep learning methods in metagenomics: a review. \u003cem\u003eMicrob. Genomics\u003c/em\u003e ;\u003cb\u003e10\u003c/b\u003e(4). (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBen Khedher, M., Ghedira, K., Rolain, J. M., Ruimy, R. \u0026amp; Croce, O. Application and Challenge of 3rd Generation Sequencing for Clinical Bacterial Studies. \u003cem\u003eIJMS\u003c/em\u003e \u003cb\u003e23\u003c/b\u003e (3), 1395 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGovender, K. N. \u0026amp; Eyre, D. W. Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications. \u003cem\u003eMicrob. Genomics\u003c/em\u003e ;\u003cb\u003e8\u003c/b\u003e(10). (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMeslier, V. et al. Benchmarking second and third-generation sequencing platforms for microbial metagenomics. \u003cem\u003eSci. Data\u003c/em\u003e. \u003cb\u003e9\u003c/b\u003e (1), 694 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBokulich, N. A., Ziemski, M., Robeson, M. S. \u0026amp; Kaehler, B. D. Measuring the microbiome: Best practices for developing and benchmarking microbiomics methods. \u003cem\u003eComput. Struct. Biotechnol. J.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e, 4048\u0026ndash;4062 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBolger, A. M., Lohse, M. \u0026amp; Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e30\u003c/b\u003e (15), 2114\u0026ndash;2120 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi, D., Liu, C. M., Luo, R., Sadakane, K. \u0026amp; Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct \u003cem\u003ede Bruijn\u003c/em\u003e graph. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e31\u003c/b\u003e (10), 1674\u0026ndash;1676 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNurk, S., Meleshko, D., Korobeynikov, A. \u0026amp; Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. \u003cem\u003eGenome Res.\u003c/em\u003e \u003cb\u003e27\u003c/b\u003e (5), 824\u0026ndash;834 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeger, A. \u0026amp; Leonardi, T. pycoQC, interactive quality control for Oxford Nanopore Sequencing. \u003cem\u003eJOSS\u003c/em\u003e \u003cb\u003e4\u003c/b\u003e (34), 1236 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBonenfant, Q., No\u0026eacute;, L. \u0026amp; Touzet, H. Porechop_ABI: discovering unknown adapters in Oxford Nanopore Technology sequencing reads for downstream trimming. Zhang Z, editor. Bioinformatics Advances. ;3(1):vbac085. (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. \u003cem\u003eNat. Methods\u003c/em\u003e. \u003cb\u003e17\u003c/b\u003e (11), 1103\u0026ndash;1110 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBenoit, G. et al. High-quality metagenome assembly from long accurate reads with metaMDBG. \u003cem\u003eNat. Biotechnol.\u003c/em\u003e \u003cb\u003e42\u003c/b\u003e (9), 1378\u0026ndash;1383 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMartin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. \u003cem\u003eEMBnet j.\u003c/em\u003e \u003cb\u003e17\u003c/b\u003e (1), 10 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGurevich, A., Saveliev, V., Vyahhi, N. \u0026amp; Tesler, G. QUAST: quality assessment tool for genome assemblies. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e29\u003c/b\u003e (8), 1072\u0026ndash;1075 (2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShen, W., Le, S., Li, Y., Hu, F. \u0026amp; SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. Zou Q, editor. PLoS ONE. ;11(10):e0163962. (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUritskiy, G. V., DiRuggiero, J. \u0026amp; Taylor, J. MetaWRAP\u0026mdash;a flexible pipeline for genome-resolved metagenomic data analysis. \u003cem\u003eMicrobiome\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e (1), 158 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. \u003cem\u003ePeerJ\u003c/em\u003e \u003cb\u003e7\u003c/b\u003e, e7359 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu, Y. W., Simmons, B. A. \u0026amp; Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. \u003cem\u003eBioinformatics\u003c/em\u003e \u003cb\u003e32\u003c/b\u003e (4), 605\u0026ndash;607 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlneberg, J. et al. \u003cem\u003eCONCOCT: Clustering cONtigs on COverage and ComposiTion\u003c/em\u003e (arXiv, 2013).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eParks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. \u0026amp; Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. \u003cem\u003eGenome Res.\u003c/em\u003e \u003cb\u003e25\u003c/b\u003e (7), 1043\u0026ndash;1055 (2015).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOndov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cb\u003e17\u003c/b\u003e (1), 132 (2016).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChaumeil, P. A., Mussig, A. J., Hugenholtz, P. \u0026amp; Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Borgwardt K, editor. Bioinformatics. ;38(23):5315\u0026ndash;6. (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOlm, M. R., Brown, C. T., Brooks, B. \u0026amp; Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. \u003cem\u003eISME J.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e (12), 2864\u0026ndash;2868 (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWard, J. H. Hierarchical Grouping to Optimize an Objective Function. \u003cem\u003eJ. Am. Stat. Assoc.\u003c/em\u003e \u003cb\u003e58\u003c/b\u003e (301), 236\u0026ndash;244 (1963).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSchwengers, O. et al. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification: Find out more about Bakta, the motivation, challenges and applications, here. \u003cem\u003eMicrob. Genomics\u003c/em\u003e ;\u003cb\u003e7\u003c/b\u003e(11). (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eH\u0026ouml;rdt, A. et al. Analysis of 1,000\u0026thinsp;+\u0026thinsp;Type-Strain Genomes Substantially Improves Taxonomic Classification of Alphaproteobacteria. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, 468 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNouioui, I. et al. Genome-Based Taxonomic Classification of the Phylum Actinobacteria. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cb\u003e9\u003c/b\u003e, 2007 (2018).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFilardo, S., Di Pietro, M. \u0026amp; Sessa, R. Current progresses and challenges for microbiome research in human health: a perspective. \u003cem\u003eFront. Cell. Infect. Microbiol.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e, 1377012 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Rossum, T., Ferretti, P., Maistrenko, O. M. \u0026amp; Bork, P. Diversity within species: interpreting strains in microbiomes. \u003cem\u003eNat. Rev. Microbiol.\u003c/em\u003e \u003cb\u003e18\u003c/b\u003e (9), 491\u0026ndash;506 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cb\u003e21\u003c/b\u003e (1), 30 (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBein, B. et al. Long-read sequencing and genome assembly of natural history collection samples and challenging specimens. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cb\u003e26\u003c/b\u003e (1), 25 (2025).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCompeau, P. E. C., Pevzner, P. A. \u0026amp; Tesler, G. How to apply de Bruijn graphs to genome assembly. \u003cem\u003eNat. Biotechnol.\u003c/em\u003e \u003cb\u003e29\u003c/b\u003e (11), 987\u0026ndash;991 (2011).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKolmogorov, M., Yuan, J., Lin, Y. \u0026amp; Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. \u003cem\u003eNat. Biotechnol.\u003c/em\u003e \u003cb\u003e37\u003c/b\u003e (5), 540\u0026ndash;546 (2019).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBickhart, D. M. et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. \u003cem\u003eNat. Biotechnol.\u003c/em\u003e \u003cb\u003e40\u003c/b\u003e (5), 711\u0026ndash;719 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu, L., Yang, Y., Deng, Y. \u0026amp; Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. \u003cem\u003eMicrobiome\u003c/em\u003e \u003cb\u003e10\u003c/b\u003e (1), 209 (2022).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLuan, T. et al. Benchmarking short and long read polishing tools for nanopore assemblies: achieving near-perfect genomes for outbreak isolates. \u003cem\u003eBMC Genom.\u003c/em\u003e \u003cb\u003e25\u003c/b\u003e (1), 679 (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoang, M. T. V., Irinyi, L., Hu, Y., Schwessinger, B. \u0026amp; Meyer, W. Long-Reads-Based Metagenomics in Clinical Diagnosis With a Special Focus on Fungal Infections. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cb\u003e12\u003c/b\u003e, 708550 (2022).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Shotgun metagenomics, microbiome, next-generation sequencing, third-generation sequencing, MAGs, functional analysis, mock analysis","lastPublishedDoi":"10.21203/rs.3.rs-7581938/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7581938/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eTwo culture-independent methods, amplicon-based sequencing and shotgun metagenomics, have significantly advanced the study of microbial communities. To date, short-read sequencing technologies have enabled high accuracy and deep coverage, while long-read sequencing approaches are increasingly being applied to improve genome assembly, despite challenges related to sequencing accuracy and nucleic acids input requirements. In this benchmark study, we compared the shotgun metagenomics approach across three sequencing technologies, Illumina (short reads), PacBio and Nanopore (long reads), using a commercial microbial community consisting of 20 known species. Specifically, we evaluated the effectiveness of the data generated by each platform in reconstructing and identifying specific known taxa, as well as in understanding their genetic and functional potential, considering annotated genes, length of predicted proteins and number/types of inferred functions.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eIllumina sequencing provided high-throughput and high-quality data, but its limited read length precluded complete genome assembly. This affected the functional analysis, leading to an underestimation of the coding and non-coding genes. Nanopore sequencing yielded the longest reads, resulting in more contiguous assemblies, although it was impacted by higher error rates and the choice of assembly method. PacBio offered the best balance between read length and base accuracy, but with a lower number of reads. This affected genome coverage for a few specific taxa, influencing the quality of their assemblies, the completeness of MAGs (Metagenome Assembled Genomes), and the accuracy of functional annotation. Nevertheless, PacBio successfully retrieved MAGs for all mock community species, and the genomes annotation was consistent with the reference.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThis study offers a valuable framework to guide the selection of sequencing strategies in metagenomic research. Understanding the strengths and limitations of each step of metagenomic workflows, from library preparation to bioinformatic analysis, is crucial for driving its ongoing optimization.\u003c/p\u003e","manuscriptTitle":"Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-30 01:01:08","doi":"10.21203/rs.3.rs-7581938/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-01-16T05:24:34+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-07T08:50:11+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-30T09:54:22+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"197444914511652828616302824916949333114","date":"2025-12-10T10:45:59+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"232128720252427274136390347213703561933","date":"2025-12-10T09:28:32+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-09T00:27:58+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-05T11:50:41+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-12-04T15:36:30+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"ebd17067-0b7a-4015-8c90-3eaa7f3eb200","owner":[],"postedDate":"January 30th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-16T08:08:18+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-30 01:01:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7581938","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7581938","identity":"rs-7581938","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00