Unraveling the Impact of Genome Assembly on Bacterial Typing: A One Health Perspective

doi:10.21203/rs.3.rs-4692225/v1

Unraveling the Impact of Genome Assembly on Bacterial Typing: A One Health Perspective

2024 · doi:10.21203/rs.3.rs-4692225/v1

preprint OA: closed

Full text JSON View at publisher

Full text 120,195 characters · extracted from preprint-html · click to expand

Unraveling the Impact of Genome Assembly on Bacterial Typing: A One Health Perspective | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Unraveling the Impact of Genome Assembly on Bacterial Typing: A One Health Perspective Déborah Merda, Meryl Vila-Nova, Mathilde Bonis, Anne-Laure Boutigny, and 8 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4692225/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 08 Nov, 2024 Read the published version in BMC Genomics → Version 1 posted 4 You are reading this latest preprint version Abstract Background In the context of pathogen surveillance, it is crucial to ensure interoperability and harmonized data. Several surveillance systems are designed to compare bacteria and identify outbreak clusters based on core genome MultiLocus Sequence Typing (cgMLST). Among the different approaches available to generate bacterial cgMLST, our research used an assembly-based approach (chewBBACA tool). Methods Simulations of short-read sequencing were conducted for 5 genomes of 27 pathogens of interest in animal, plant, and human health to evaluate the repeatability and reproducibility of cgMLST. Various quality parameters, such as read quality and depth of sequencing were applied, and several read simulations and genome assemblies were repeated using three tools: SPAdes, Unicycler and Shovill. In vitro sequencing were also used to evaluate assembly impact on cgMLST results, for 6 bacterial species: Bacillus thuringiensis, Listeria monocytogenes , Salmonella enterica , Staphylococcus aureus , and Vibrio parahaemolyticus . Results The results highlighted variability in cgMLST, which appears unrelated to the assembly tools, but rather induced by the intrinsic composition of the genomes themselves. This variability observed in simulated sequencing was further validated with real data for five of the bacterial pathogens studied. Conclusion This highlights that the intrinsic genome composition affects assembly and resulting cgMLST profiles, that variability in bioinformatics tools can induce a bias in cgMLST profiles. In conclusion, we propose that the completeness of cgMLST schemes should be considered when clustering strains. Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction In a One Health perspective, it is essential to maintain a global system of surveillance to better perceive and understand transmission events between animals, humans, and the environment. These surveillance systems need to be harmonized and to ensure interoperability between all the data generated so that they may be shared among all surveillance players, such as public health authorities, research institutions, and laboratories. These systems also involve several scientific domains, such as plant pathology or veterinary, medical, and food safety. The importance of such sharing of data has recently been proven for real-time monitoring of outbreaks or pandemics, as highlighted during the SARS-CoV-2 pandemic or other recent virus outbreaks ( 1 ). Such systems are already used in bacteria monitoring systems to identify the origins and transmission routes of antimicrobial resistance ( 2 , 3 ), or to monitor food-associated pathogens. Recommendations have thus been proposed to facilitate collaboration around data ( 4 ), EFSA (2022). These recommendations suggested in particular (i) defining quality criteria so as to ensure data trustworthiness, and (ii) providing guidelines and reference analytical tools for data processing while limiting the impact of their storage. To implement these recommendations, current systems for bacteria surveillance are primarily based on typing results ( 5 ). The reference method for bacterial typing is multi-locus sequence typing (MLST), based on seven housekeeping genes. It was developed for the first time in 1998 with Neisseria meningitidis and since then, the number of schemes available in the pubMLST database has steadily increased to over 130, demonstrating the ongoing growth and diversification of this typing method over time ( 6 ). In the last few decades, the development of whole genome sequencing (WGS) has opened the path to gene-by-gene approaches to extend the MLST concept to all genes composing the core genome (cg) of bacterial species. This method, called cgMLST, is more discriminating than MLST due to its higher genome coverage level. Zoonotic and foodborne pathogen surveillance is increasingly based on these new approaches, and most of the surveillance initiative tools published recently recommend using cgMLST outputs for comparing bacterial strains and identifying clusters of genetically-related strains (PulseNet USA ( 7 ), GenoSalmSurv ( 8 ), EFSA (2022)). Recently, an outbreak caused by Listeria monocytogenes ST1247 was investigated in five European countries (Denmark, Estonia, Finland, France, and Sweden), using the cgMLST approach ( 9 ). In this study, only three allelic differences were found out of the 1744 loci detected from the 1748-loci cgMLST scheme (10). Likewise, this method was used to investigate the global outbreak caused by Salmonella Typhimurium ST34 in chocolate-based products between 2021 and 2022. Cases were reported in 12 European Union countries, the UK, Switzerland, USA, and Canada ( 11 ). Unlike methods based on read mapping, a variant that requires a reference genome to which reads are aligned, the gene-by-gene approach is reference-free, enabling better consideration of genetic variability among bacterial strains. Moreover, cgMLST appears to be less affected by homologous recombination than SNP analysis, and can be used to investigate outbreaks from highly recombinant pathogens like Pseudomonas aeruginosa ( 12 ), Salmonella enterica ( 13 ) or Xylella fastidiosa ( 14 ). Furthermore, it is straightforward to establish nomenclature systems that can be shared among multiple institutes and/or analyses, facilitating the creation of a global monitoring system. These schemes and sequence variants are publicly available in several databases, e.g., PubMLST ( https://pubmlst.org/ ), BIGSdb-Pasteur ( https://bigsdb.pasteur.fr ), EnteroBase ( https://enterobase.warwick.ac.uk/ ), cgmlst.org ( https://cgmlst.org/ncs ) from Ridom SeqSphere and Chewie-NS ( https://chewie-ns.readthedocs.io/en/latest/ ) ( 15 ). There are different approaches to calling alleles and obtaining cgMLST profiles. One of them maps direct reads to a scheme to call genes, as implemented in Mentalist ( 16 ). A second approach, implemented in ChewBBACA ( 17 ), is assembly-based, and requires genome assembly before calling cgMLST profiles. Various systems use it, like INNUENDO ( 18 ). ChewBBACA is also implemented in an interoperable system shared by the European Food Safety Authority (EFSA) and the European Centre for Disease Prevention and Control (ECDC), which was set up in 2019 to analyze foodborne outbreaks caused by Salmonella enterica , Listeria monocytogenes , and Escherichia coli ( 19 ). De novo assembly is a crucial step after sequencing to reconstruct the genomes of pathogens. Several pipelines designed to harmonize genome assembly have been published based on specific pathogens or institutes. These pipelines use de novo assembly tools like SPAdes ( 20 ), Shovill ( 21 ) or Unicycler ( 22 ), and short reads as the data input. One of the significant challenges in bacterial genome assembly is the use of short reads produced by next generation sequencing (NGS). Indeed, NGS tools can be easily impacted by genome composition, for example the occurrence of repeated sequences such as insertion sequences (IS), variable number tandem repeats (VNTRs), or homopolymers, which are very difficult to assemble. In addition, regions that vary greatly in GC composition have a poor sequencing coverage, leading to genome fragmentation ( 23 ). The aim of this study was to evaluate the impact of assembly tools on bacteria to highlight the need for pipeline harmonization and to share cgMLST profiles with the EFSA/ECDC system, where cgMLST analyses are performed with ChewBBACA. Twenty-seven bacterial species corresponding to significant pathogens from a One Health perspective were examined in this study. These species encompass foodborne, plant, and animal pathogens. We compared the three tools most frequently used for assembly purposes: SPAdes ( 20 ), Unicycler ( 22 ) and Shovill ( 21 ). The effect of the quality and depth of sequenced reads was evaluated on cgMLST results. The repeatability and reproducibility of analyses were also tested using both in silico and in vitro sequencing. We observed a major bioinformatic variability in the cgMLST profiles obtained, and therefore proposed recommendations to enhance interoperability between genomic results and to decrease the risk of excluding strains linked to each other in epidemic clusters. 2. Material and Methods 2.1 Experimental scheme The genomes of 27 bacterial pathogen species— Bacillus cereus , Bacillus thuringiensis , Bacillus cytotoxicus, Brucella melitensis, Burkholderia mallei, Campylobacter spp., Citrobacter spp., Clostridium botulinum, Clostridium difficile, Clostridium perfringens, Escherichia coli, Klebsiella aerogenes, Leptospira interrogans, Listeria monocytogenes, Mycobacterium bovis, Mycobacterium tuberculosis, Neisseria meningitides, Pseudomonas aeruginosa, Ralstonia solanacearum, Salmonella enterica, Staphylococcus argenteus, Staphylococcus aureus, Taylorella equigenitalis, Vibrio cholera, Vibrio parahaemolyticus, Xylella fastidiosa , and Yersinia enterocolitica —were used to perform these analyses (Table S1 ). The species were chosen according to the interest in these pathogens for public health, and their risk in food safety. A minimum of five circularized genomes were randomly chosen from the public NCBI database, resulting in 140 genomes being analyzed. All strain accession numbers are available in the supplementary data (Table S1 ). The experimental design is presented in Fig. 1 a. The short read paired end of 150 bp was simulated using ART v. 2.3.7 ( 24 ) to mimic Illumina sequencing. Phred quality scores (Q) for Illumina sequencing are guaranteed to be at least 95% above Q30 for all platforms, such as MiSeq, HiSeq and NextSeq. Two quality scores were then simulated: greater than Q40 to simulate high-quality reads and less than Q40 to estimate the impact of low-quality reads. The depth of sequencing can also differ depending on the multiplexing and sequencing platforms chosen. Because sequencing depth can affect genome assembly results, five different depths were simulated: 25x, 50x, 75x, 100x and 150x. The reproducibility of assembly, tested by comparing assembly following independent read simulations and cgMLST typing, was evaluated for three different simulated datasets of high-quality reads. Thus, a total of 2800 reads were simulated, with each genome undergoing 20 simulations. Read simulations were verified using fastp v. 0.20.1 ( 25 ). 2.2 Real dataset In vitro sequencing data were used to validate simulation results for six bacterial species. The experimental design is presented in Fig. 1 b. We used 28 different strains: five for Bacillus thuringiensis , five for Listeria monocytogenes , five for Salmonella enterica , five for Staphylococcus aureus , four for Vibrio parahaemolyticus , and four for Xylella fastidiosa . (Table S2 ). DNA was extracted from all these strains and sequenced independently twice. Quality was assessed and reads were trimmed using fastp v. 0.20.1 ( 25 ). Finally, a total of 56 sequencing results were analyzed. 2.3 Assembly In order to evaluate the impact of assembly tools on cgMLST typing, three tools were selected: SPAdes v.3.14.1 ( 20 ), Shovill v.1.0.9 ( 21 ), and Unicyler v.0.4.8 ( 22 ) using default settings. All the simulated and real sequenced reads were assembled with these three tools. To validate the repeatability of genome assembly by comparing assemblies obtained with the same tool and the same dataset simulation, each tool was used independently three times on high-quality simulated reads with a Phred score above 40 and a depth exceeding 75x. Real sequenced reads were also assembled independently three times. In all, 12,156 assemblies were generated for simulated data and 1296 assemblies for in vitro data. 2.4 Typing All assemblies listed in Table S1 (n = 140) were analyzed to generate the corresponding cgMLST profiles using chewBBACA v. 2.8.5, as recommended by the EFSA/ECDC system. Whenever possible, we used publicly available schemes from cgmlst.org or Big-SDB (Table S3 ). For Taylorella equigenitalis and Xylella fastidiosa , unpublished schemes were used to obtain cgMLST profiles with chewBBACA. The EFSA/ECDC system recommends using chewBBACA v. 2.8.5 or more recent versions ( 19 ). In our study, cgMLST profiles were computed using chewBBACA v. 2.8.5 tools after assembly annotation using Prodigal ( 17 ). 2.5 Assembly quality parameters and visualization of cgMLST results In order to compare assembly quality, four parameters from Quast results were analyzed ( 26 ). To evaluate genome fragmentation, we compared contig numbers and largest contig sizes in all the assemblies. To assess assembly truthfulness, the number of misassemblies were detected by comparison with the initial genome and NGA50. For each strain of all 27 species, assembly results were aligned with minimap2 (Li, 2018) implemented in Quast to the initial reference genome used for read simulation. Alignment was used to visualize contig fragmentation and evaluate assembly reproducibility and repeatability. The python library (seaborn v. 0.11.2 ( 27 ) and Circos v. 0.1.3 ( 28 )) were used for all visualizations. The cgMLST profiles of simulated datasets were compared by computing the allelic differences between genomes from NCBI and assembly results with GrapeTree v. 2.1 ( 29 ) after normalization. To obtain a completeness percentage for each scheme, this normalization step focused on the gene number in the scheme for each species analyzed (Table S3 ). The completeness was calculated on the basis of genes found by cgMLST analysis compared with the total number of genes in each scheme. The cgMLST results from real data were analyzed using the minimum spanning tree calculated with GrapeTree ( 29 ) and the MSTreeV2 method. These trees were visualized using the GrapeTree web application (achtman-lab.github.io/GrapeTree/MSTree_holder.html). Results Evaluation of assembly reproducibility according to sequencing quality using simulated data A key requirement for sharing data between interoperable surveillance systems is to evaluate the repeatability and reproducibility of analysis and to propose quality criteria for data inclusion. The assembly tools chosen (SPAdes, unicycler and Shovill) were selected because they have been frequently used in recently published workflows dedicated to bacterial WGS. We evaluated the impact of read quality on sequencing simulations for 27 bacterial species, and observed that poor data quality (Q < 40) decreases the quality of assembly: Assemblies were impossible to draft with Shovill, because the tool did not accept input data, or were shorter and more fragmented with SPAdes and Unicycler (Supplementary data S1). For Vibrio parahaemolyticus , the maximum number of contigs was 80 with high-quality data (Q > 40) but increased to 120 with poor-quality data. For some species, such as Bacillus cereus, Clostridium perfringens , Taylorella Mycobacterium tuberculosis , and Ralstonia solanacearum , some genome parts were even missing from the final assembly obtained with a poor read quality (Supplementary data S2), in position 0 Mb for Bacillus cereus , 0.1 Mb for Clostridium perfringens , 4.0 Mb for Mycobacterium tuberculosis , and 2.8 Mb for Ralstonia solanacearum . Furthermore, the poor quality of reads also increased genome misassemblies compared with results obtained with a high read quality. Indeed, in Klebsiella aerogenes , at a depth of 75x, the maximum percentage of misassemblies was 40% with poor-quality reads whereas the figure for high-quality reads could drop as far as 0%. For example, in Mycobacterium bovis , there were 20% of misassemblies with poor-quality reads vs. 0% with high-quality reads; in Neisseria meningitides these figures were 40% (poor quality) vs. 20% (high quality); in Staphylococcus argenteus they were 20% (poor quality) vs. 0%; and in Bacillus cereus , 20% (poor quality) vs. 7%. For Clostridium perfringens , the rate of misassemblies obtained with a poor read quality could reach 60% in some assemblies. For other species such as Campylobacter spp., Listeria monocytogenes , Escherichia coli or Vibrio cholerae , assembly results appeared to be less affected by a poor read quality (Supplementary data S1). When we compared the impact of various sequencing depths, we observed an optimal threshold at 75x. At this value, parameters representing high-quality assembly are maximized, i.e., the number of contigs and misassemblies decrease, and both N50 and total length increase. Mahn-Whitney tests used to compare the four-parameter distribution obtained at different sequencing depths were significant (Table S4 ). Results with 150x and 100x were identical. Comparing 25x with 100x, contig number distributions were significantly different for 10/27 species, N50 distributions significantly different for 21/27 species, misassemblies for 25/27 species, and largest contig for 16/27 species. For 50x, no difference was observed in contig number, N50 and largest contig, while misassembly distributions were different for 10/27 species. For 75x, no difference was observed in contig number, N50, and largest contig, while misassembly distributions were different for 6/27 species. Therefore, for the subsequent analyses, we present results derived from high-quality reads at a depth of 75x (Supplementary data S3). Comparison of assembly tools with a high read quality and sufficient depth using simulated data To determine which tool performs better in genome assembly, SPAdes, Shovill and Unicycler were compared using simulated sequencing data with a high quality and mean depth of 75x. Our results indicated that assembly repeatability does not depend on the tools used but instead appears to be genome-dependent. An alignment of the generated assemblies to the reference used for the sequencing simulation revealed that both Shovill and Unicycler performed better for Listeria monocytogenes and Ralstonia solanacearum than for most the 27 bacterial species (Fig. 2 A). Interestingly, these tools fragmented the genome into similar genomic regions, which seem to correlate with variations in GC content across the genome. However, assembling the genome of Mycobacterium bovis and Xylella fastidiosa with the same assembly tool led to different results (Fig. 2 B). Specifically, for these two species, assembly replicates obtained from the same simulated dataset produced identical contigs, as was observed for all studied genomes in our dataset, but for these two species, the assembly differed for each sequencing simulation dataset (i.e., read simulations obtained from the same genome). Impact of assembly tools on cgMLST profiles using simulated data Once the optimum quality criteria for sequencing were determined, the impact of cgMLST analyses was evaluated for 21 species for which a cgMLST scheme was available. The cgMLST profiles obtained from high-quality sequencing (i.e., Q > 40) with sufficient depth (i.e., depth = 75X) classified bacterial species into two categories based on the allelic difference rates observed between the reference genome and the assemblies obtained (Fig. 3 ). Results from SPAdes consistently exhibited higher assembly fragmentation and misassemblies than those obtained with Shovill and Unicycler, and are not therefore presented here. The first category (group 1) comprised 14 out of 21 bacterial species that had less than 5% of errors between the reference and the assembly obtained. For group 1, results suggested that the choice of assembler should vary according to the species studied (Fig. 3 A). Indeed, for Escherichia. coli , Mycobacterium tuberculosis , Vibrio cholerae , and Taylorella equigenitalis , a significant difference ( p-value < 5% for Mann-Whitney test) was observed between Shovill and Unicycler results, suggesting that Shovill gave cgMLST profiles closest to the reference. However, for Neisseria meningitidis and Leptospira interrogans , the allelic profiles were closest to the reference when Unicycler was used, although no significant difference was observed when checked with the Mann-Whiney test. The second category (group 2) comprised 7 out of 21 bacterial species for which the number of allelic differences between the reference and the assembly obtained was greater than 5% (Fig. 3 B), with a maximum of 30% for Salmonella enterica . Within group 2, few differences were observed between the results obtained from Shovill and Unicycler assemblies, suggesting that the choice of assembly tool may be negligible compared with the intrinsic genome composition, except for Campylobacter spp. for which a significant difference was observed between distribution results from the two tools. Comparison of cgMLST profiles obtained with different sequencing depths using simulated data Related strains were identified by clustering cgMLST profiles obtained with different data quality and depth combinations. In open-source surveillance systems or applications, various data qualities can be shared with the science community with diverse internal sequencing capacities and/or quality thresholds. To evaluate the impact of various sequencing depths on cgMLST results, we compared simulated sequencing data associated with mean depths of 25x, 50x, and 75x. The number of allelic differences between reference cgMLST profiles and cgMLST profiles obtained significantly increased for assemblies with a sequencing depth less than 75x for all species belonging to group 1 (Fig. 4 a). Only four out of 21 bacterial species, all belonging to group 2 previously described (i.e., greater than 5%), appeared not to be impacted by the quality of sequenced data: Bacillus cereus, Bacillus cytotoxicus, Bacillus thuringiensis , and Vibrio parahaemolyticus (Fig. 4 b), as no significant difference was observed. However, for other species—regardless of whether they belong to the first or second group previously described—the number of allelic differences was significantly higher with poor depth (Q < 40) using simulated sequencing data. These results underscored the importance of performing genomic typing on harmonized, high-quality data with a sufficient sequencing depth to investigate outbreaks. Confirmation of reproducibility and repeatability when sequencing real data To confirm the poor repeatability and reproducibility of cgMLST results obtained using simulated sequencing data and evaluate the impact on real data, we analyzed biological replicates of bacterial strains from six species. The cgMLST profiles were computed for each biological replicate to evaluate reproducibility, and bioinformatics analyses were performed in triplicate to investigate repeatability. The cgMLST profiles obtained using real data showed that the results were repeatable between analyses, as also observed with simulated sequencing. Indeed, the cgMLST profiles resulting from SPAdes and Unicycler assemblies were comparable between each replicate, indicating 100% repeatability, as no distance was observed between assemblies obtained from the same raw data (Fig. 5 ). However, poor reproducibility was observed between the biological replicates, with distances observed between the same strains for which raw data were provided from two independent extractions. This finding suggests that the wet lab part has a major impact on cgMLST profiles, despite using the same DNA extraction protocol for Salmonella enterica , Staphylococcus aureus , and Xylella fastidiosa . Indeed, only four out of 28 strains had identical profile results with Unicycler. With Shovill, repeatability seemed to be dependent on the species. For instance, for Listeria monocytogenes all analyses were 100% identical, whereas for Staphylococcus aureus, Vibrio parahaemolyticus , and Xylella fastidiosa the strains had different cgMLST profiles resulting from distinct assemblies. For Salmonella enterica and Bacillus thuringiensis , one and two strains, respectively, gave different cgMLST profiles between analyses, but only one gene was systematically affected. The cgMLST profiles for biological replicates were found to be identical for eight out of 28 analyzed strains (Fig. 5 ). These eight strains belong to Bacillus thuringiensis (two out of five strains), Listeria monocytogenes (four out of five strains), Vibrio parahaemolyticus (one out of four strains), and Salmonella enterica (one out of five strains). This level of reproducibility was mainly observed for the results generated by SPAdes and Unicycler, although only the Unicycler results maximized the completeness of the cgMLST scheme, i.e., more genes in the cgMLST scheme were found after Unicycler assembly. Conversely, with Shovill, only five strains had the same cgMLST profiles for biological replicates (one Bacillus thuringiensis , and four Listeria monocytogenes ), and only four strains gave profiles that were identical to the Unicycler results (one Bacillus thuringiensis and three Listeria monocytogenes ). The number of allelic differences between biological replicates was found to be elevated (22 allelic differences between two Listeria monocytogenes replicates or 184 between two Staphylococcus aureus replicates), suggesting potential ambiguity for closely-related strains (Fig. 5 ). Depending on the species and assembly tools used, the number of allelic differences between biological replicates varied significantly, ranging from 10 allelic differences for Bacillus thuringiensis , to 138 for Salmonella enterica with Unicycler. Results obtained for two closely-related strains of Xylella fastidiosa subsp. multiplex , both belonging to ST6 based on the MLST of seven housekeeping genes (Amandine Cunty, personal communication), were mixed for cgMLST results, whereas they were found to be distinguishable in SNP analyses (data not shown). These results suggested that for outbreak investigations using this method, it may be challenging to discriminate the strain responsible for the outbreak and consequently determine its source. Discussion cgMLST typing is one of the most widely used genomic methods for surveillance of bacterial pathogens. Our study aimed to investigate how the assembly step influences cgMLST profiles. Our results indicated that assembly-based cgMLST analyses, considering the entire scheme, may vary depending on the assembly method used. This represents a significant limitation for the gene-by-gene approach in interoperable systems, which aggregate data from various analytical pipelines. However, the observed differences, often referred to as false negatives, primarily involve genes that are missing rather than allelic differences potentially resulting in different allelic combinations. The results obtained in this study highlight an impact of assembly on cgMLST profiles that is greater for particular bacterial species. Indeed, genomic composition may influence assembly quality, leading to possible contig fragmentation within a cgMLST gene. Repeat sequences such as insertion sequences (IS) or VNTRs can influence assembly quality, among other factors. A previous study demonstrated that the number of contigs obtained after assembly was correlated with the number of repeat elements in genomes ( 30 ). The variability in GC content can also lead to non-reproducible analyses ( 31 ) due to biases introduced during sequencing, which alter sequencing depth in these regions ( 23 ). Moreover, increased variability in a genome leads to a higher degree of bias observed during sequencing. This bias affects all assembly methods using short reads, since the corresponding tools are not capable of effectively handling inconsistent sequencing depths. Although Unicycler showed better performance in reducing misassemblies than SPades ( 22 ) and Shovill, all three tools produced similar results in terms of genome contig fragmentation. The ability of a pathogen to capture external DNA by homologous recombination can directly impact GC content in recombination hotspots ( 32 ). Thus, the difficulty in assembling genomes could be more pronounced for bacterial species with more frequent homologous recombination. Our results revealed two distinct groups with less than or more than 5% of allelic differences, respectively. Group 1, for which an allelic variation lower than 5% was described, included Listeria monocytogenes, Staphylococcus aureus , and Brucella melitensis , among others. For these species, mutations were identified as the primary evolutionary force responsible for polymorphism ( 33 – 35 ). In contrast, within the second group—exemplified by Xylella fastidiosa and Salmonella enterica —strains had cgMLST results that were significantly different from those of the reference, indicating that recombination was the main evolutionary force ( 14 , 36 ). In addition to intrinsic genomic composition, our results showed that sequencing quality affected cgMLST-typing. A recent study conducted with four food pathogens: Campylobacter spp. , Listeria monocytogenes , Salmonella enterica , and Escherichia coli , demonstrated variability induced by the wet lab part of WGS analyses ( 37 ). In our study, we observed that bioinformatics analyses could also introduce variability in results. In a precedent study based on read simulations, the authors proposed a depth threshold at 50x based on analyses carried out on food pathogens Escherichia coli , Listeria monocytogenes , and Salmonella enteric a [38]. It should be noted that the analyses were conducted on a single strain per species, using a single tool (SPAdes) to compare typing results. However, by increasing the number of strains and the diversity of species investigated, our results showed that the quality of assembly obtained from 50x affected the typing result, and this bias decreased with depths equal to or greater than 75x. In the global monitoring systems, the diversity analyzed is even greater, and it is essential to evaluate these criteria for several distinct genomes per species. For this reason, we extended the study to 27 pathogens and included several genomes per species, allowing us to evaluate both the intra- and interspecies variability. This is why we proposed a minimum depth threshold of 75x for all pathogens. Our results also showed that wet lab and bioinformatic variabilities can artificially increase the distance between related strains and thus impact outbreak investigations, potentially resulting in false negatives with unrelated strains. Indeed, when analyzing an epidemiological cluster, it is crucial to identify both the strains within the cluster and those excluded. This is based on a computation of allelic distance between strains (i.e., the number of differences between two profiles). Below a specific threshold, strains are considered related ( 38 , 39 ). Thresholds for cgMLST clustering have been proposed for several bacterial species, including Listeria monocytogenes ( 38 ), Escherichia coli ( 40 ), Staphylococcus aureus ( 41 ), and Pseudomonas aeruginosa ( 42 ), and several methods to estimate them have been developed based on modeling ( 38 ) or nonparametric statistics ( 39 ). However, in monitoring systems, such as Chewie-NS or GenoSalmSurv, the thresholds are applied exclusively to allelic differences, with the number of undiscovered loci frequently not taken into consideration. Yet, as we have shown in this study, the genome quality can highly affect the completeness of cgMLST results (i.e., the number of genes that are found during analysis). This parameter increases the weight for allelic differences. For example, the established threshold for Staphylococcus aureus is 24 different alleles to define a cluster of related strains [42], with a complete cgMLST scheme comprising 1861 genes. However, our results were obtained using only 1005 genes. So, based on the reduction in the scheme’s completeness, the threshold should be reduced to 13 different alleles for this specific clustering analysis. Consequently, for outbreak investigations, it may be beneficial to include the value of scheme completeness (as defined by Palma et al. (2022)), and to propose quality criteria, which maximize this value in monitoring systems. Other parameters—such as homologous recombination and GC content—could be taken into account by a gene-by-gene approach to scheme definition, as the GC bias could lead to major genome fragmentation in assembly analyses. However, these propositions should be balanced against the need to consider some of the evolutionary history of outbreaks, given that GC and recombination represent horizontal gene transfers (HGTs). Yet, these transfers are very important for the evolution of virulence among bacteria, as shown for Yersinia enterocolitca ( 43 ). As recently proposed by Duval et al., these thresholds should not be defined by species but rather by either outbreak, taking into account evolutionary parameters (such as mutation, duration, etc.) specific to outbreaks ( 38 ), or by specific lineages that could have a specific evolutionary mechanism (such as being highly clonal) compared with other lineages. Furthermore, the development of assembly-free methods like SNP approaches at pangenome level could facilitate outbreak investigations using the pangenome graph method. Conclusion Our study assessed the bioinformatic variability induced in bacterial typing analyses using the cgMLST method. By including foodborne and clinical pathogens, and using simulated and real data, our findings led us to propose new practices when implementing this method in surveillance systems, such as integrating the notion of completeness for outbreak investigation, and establishing minimum quality criteria for sequencing. Declarations Competing Interests The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Funding This work was supported by the SPAAD unit’s internal resources. Author Contribution V.C. conceived and designed the experiments. D. M. designed the analytical strategy and performed analyses. M.V.N. participated in analytical strategy and revised the paper. M.B., A.L.B., T. B., M.C., A.C., A.R., M.S., N.V., and C.Y. collected the samples and extracted the DNA for whole genome sequencing. D.M. and V.C. wrote and revised the paper. All the authors read and approved the final manuscript. Acknowledgement We thank Laurent Guillier for discussion about analyses, and we thank Delphine Libby-Claybrough for English editing. Data Availability Sequence data that support the findings of this study have been deposited in the NCBI with the primary accession code PRJNA1129992. References Oude Munnink BB, Sikkema RS, Nieuwenhuijse DF, Molenaar RJ, Munger E, Molenkamp R, et al. Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans. Sci 8 janv. 2021;371(6525):172–7. Chakraborty T, Barbuddhe S. Enabling One Health solutions through genomics. Indian J Med Res. 2021;153(3):273. Wheeler NE, Price V, Cunningham-Oakes E, Tsang KK, Nunn JG, Midega JT, et al. Innovations in genomic antimicrobial resistance surveillance. Lancet Microbe 1 déc. 2023;4(12):e1063–70. Timme RE, Wolfgang WJ, Balkey M, Venkata SLG, Randolph R, Allard M, et al. Optimizing open data to support one health: best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook. 2020;2(1):20. Gerner-Smidt P, Hise K, Kincaid J, Hunter S, Rolando S, Hyytiä-Trees E, et al. PulseNet USA: A Five-Year Update. Foodborne Pathog Dis mars. 2006;3(1):9–19. Maiden MCJ, Bygraves JA, Fell E, Morelli G, Russel JE, Urwin R, et al. Multilocus Sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A. 1998;95:3140–5. Scharff RL, Besser J, Sharp DJ, Jones TF, Peter GS, Hedberg CW. An Economic Evaluation of PulseNet: A Network for Foodborne Disease Surveillance. Am J Prev Med mai. 2016;50(5 Suppl 1):S66–73. Uelze L, Becker N, Borowiak M, Busch U, Dangel A, Deneke C, et al. Toward an Integrated Genome-Based Surveillance of Salmonella enterica in Germany. Front Microbiol [Internet]. 2021. 10.3389/fmicb.2021.626941 . https://www.frontiersin.org/journals/microbiology/articles/ . 12. Disponible sur. Mäesaar M, Mamede R, Elias T, Roasto M. Retrospective Use of Whole-Genome Sequencing Expands the Multicountry Outbreak Cluster of Listeria monocytogenes ST1247. Int J Genomics 1 avr. 2021;2021:1–5. Moura A, Tourdjman M, Leclercq A, Hamelin E, Laurent E, Fredriksen N, et al. Real-Time Whole-Genome Sequencing for Surveillance of Listeria monocytogenes, France. Emerg Infect Dis sept. 2017;23(9):1462–70. EFSA. Multi-country outbreak of monophasic Salmonella Typhimurium sequence type 34 linked to chocolate products – first update – 18 May 2022. EFSA Support Publ juin 2022;19(6). Blanc DS, Magalhães B, Koenig I, Senn L, Grandbastien B. Comparison of Whole Genome (wg-) and Core Genome (cg-) MLST (BioNumericsTM) Versus SNP Variant Calling for Epidemiological Investigation of Pseudomonas aeruginosa. Front Microbiol 22 juill 2020;11. Didelot X, Bowden R, Street T, Golubchik T, Spencer C, McVean G, et al. Recombination and population structure in Salmonella enterica. PLoS Genet juill. 2011;7(7):e1002191. Vanhove M, Retchless AC, Sicard A, Rieux A, Coletta-Filho HD, De La Fuente L et al. Genomic Diversity and Recombination among Xylella fastidiosa Subspecies. Appl Environ Microbiol juill 2019;85(13). Mamede R, Vila-Cerqueira P, Silva M, Carriço JA, Ramirez M. Chewie Nomenclature Server (chewie-NS): a deployable nomenclature server for easy sharing of core and whole genome MLST schemas. Nucleic Acids Res 8 janv. 2021;49(D1):D660–6. Feijao P, Yao HT, Fornika D, Gardy J, Hsiao W, Chauve C et al. MentaLiST – A fast MLST caller for large MLST schemes. Microb Genomics 1 févr 2018;4(2). Silva M, Machado MP, Silva DN, Rossi M, Moran-Gilad J, Santos S et al. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. Microb Genomics 1 mars 2018;4(3). Llarena AK, Ribeiro-Gonçalves BF, Nuno Silva D, Halkilahti J, Machado MP, Da Silva MS, et al. INNUENDO: A cross-sectoral platform for the integration of genomics in the surveillance of food-borne pathogens. EFSA Support Publ. 2018;15(11):1498E. Costa G, Di Piazza G, Koevoets P, Iacono G, Liebana E, Pasinato L et al. Guidelines for reporting Whole Genome Sequencing-based typing data through the EFSA One Health WGS System. EFSA Support Publ juin 2022;19(6). Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol mai. 2012;19(5):455–77. Seemann T. Shovill: faster SPAdes assembly of Illumina reads. 2017. Wick RR, Judd LM, Gorrie CL, Holt KE, Unicycler. Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Comput Biol 8 juin. 2017;13(6):e1005595. Chen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly. PLoS ONE. 29 avr. 2013;8(4):e62856. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinf 15 févr. 2012;28(4):593–4. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinf 1 sept. 2018;34(17):i884–90. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinf 15 avr. 2013;29(8):1072–5. Waskom M. seaborn: statistical data visualization. J Open Source Softw 6 avr. 2021;6(60):3021. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: an information esthetic for comparative genomics. Genome Res. 2009;19(604):1639–45. Zhou Z, Alikhan NF, Sergeant MJ, Luhmann N, Vaz C, Francisco AP, et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res sept. 2018;28(9):1395–404. Acuña-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics. 16 déc. 2018;19(1):54. Mavromatis K, Land ML, Brettin TS, Quest DJ, Copeland A, Clum A, et al. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation. PLoS ONE 12 déc. 2012;7(12):e48837. Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V. GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands. PLOS Genet 6 févr. 2015;11(2):e1004941. den Bakker HC, Didelot X, Fortes ED, Nightingale K, Wiedmann M. Lineage specific recombination rates and microevolution in Listeria monocytogenes. BMC Evol Biol. 2008;8(1):277. Fraser C, Hanage WP, Spratt BG. Neutral microepidemic evolution of bacterial pathogens. Proc Natl Acad Sci 8 févr. 2005;102(6):1968–73. Vishnu US, Sankarasubramanian J, Sridhar J, Gunasekaran P, Rajendhran J. Identification of Recombination and Positively Selected Genes in Brucella. Indian J Microbiol 29 déc. 2015;55(4):384–91. Park CJ, Andam CP. Distinct but Intertwined Evolutionary Histories of Multiple Salmonella enterica Subspecies. mSystems. 11 févr. 2020;5(1). Forth LF, Brinks E, Denay G, Fawzy A, Fiedler S, Fuchs J et al. Impact of wet-lab protocols on quality of whole-genome short-read sequences from foodborne microbial pathogens. Front Microbiol 29 nov 2023;14. Duval A, Opatowski L, Brisse S. Defining genomic epidemiology thresholds for common-source bacterial outbreaks: a modelling study. Lancet Microbe mai. 2023;4(5):e349–57. Radomski N, Cadel-Six S, Cherchame E, Felten A, Barbet P, Palma F et al. A Simple and Robust Statistical Method to Define Genetic Relatedness of Samples Related to Outbreaks at the Genomic Scale – Application to Retrospective Salmonella Foodborne Outbreak Investigations. Front Microbiol. 24 oct 2019;10. Schürch AC, Arredondo-Alonso S, Willems RJL, Goering RV. Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene–based approaches. Clin Microbiol Infect avr. 2018;24(4):350–4. Lagos AC, Sundqvist M, Dyrkell F, Stegger M, Söderquist B, Mölling P. Evaluation of within-host evolution of methicillin-resistant Staphylococcus aureus (MRSA) by comparing cgMLST and SNP analysis approaches. Sci Rep 22 juin. 2022;12(1):10541. Martak D, Meunier A, Sauget M, Cholley P, Thouverez M, Bertrand X, et al. Comparison of pulsed-field gel electrophoresis and whole-genome-sequencing-based typing confirms the accuracy of pulsed-field gel electrophoresis for the investigation of local Pseudomonas aeruginosa outbreaks. J Hosp Infect août. 2020;105(4):643–7. Karlsson PA, Tano E, Jernberg C, Hickman RA, Guy L, Järhult JD, et al. Molecular Characterization of Multidrug-Resistant Yersinia enterocolitica From Foodborne Outbreaks in Sweden. Front Microbiol. 2021;12:664665. Additional Declarations No competing interests reported. Supplementary Files SupplementalTableS1S2S3S4.xlsx SupplementaldataS1.docx SupplementaldataS2.docx SupplementaldataS3.docx Cite Share Download PDF Status: Published Journal Publication published 08 Nov, 2024 Read the published version in BMC Genomics → Version 1 posted Editorial decision: Revision requested 08 Jul, 2024 Editor assigned by journal 05 Jul, 2024 Submission checks completed at journal 05 Jul, 2024 First submitted to journal 05 Jul, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4692225","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":324359242,"identity":"158ada2f-0a5d-4a0c-bc00-e02830ac67b7","order_by":0,"name":"Déborah Merda","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9klEQVRIiWNgGAWjYDCCAzxgEsyWYKiAilbgUI1Fyxmo6BkcqjG1MLYRoYXveO/BxxU1dxh023sf3vg5zy5Pvr39AcPBPbi1SJ45l2x45tgzBrMzx40te7clFxucOWPAcOAZbi0GN3LMJBvYDjOY3Uhjk+Ddxpy4QSKHgfnDATxa7r8BavkH1HL/GZvk3zn1ifPnP3/AcACflhs8ZpKNbSBb2NikeRsOJzbcYDDAq0XyTF6yYWPfYR6zM2nM1jLHjiduOJNjcACfFr7jZw8+bPh2WM7s+DHGm29qqhPntx9/+ACfFhjgQeERoWEUjIJRMApGAT4AAPYZXr1OFsXwAAAAAElFTkSuQmCC","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":true,"prefix":"","firstName":"Déborah","middleName":"","lastName":"Merda","suffix":""},{"id":324359243,"identity":"717a529b-6813-4680-8450-8b609acac0f4","order_by":1,"name":"Meryl Vila-Nova","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Meryl","middleName":"","lastName":"Vila-Nova","suffix":""},{"id":324359244,"identity":"c3611f69-e836-4875-aef3-1ca6cb5c5e97","order_by":2,"name":"Mathilde Bonis","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Mathilde","middleName":"","lastName":"Bonis","suffix":""},{"id":324359245,"identity":"6591d76b-7d8e-476c-b007-67d5351b8c0b","order_by":3,"name":"Anne-Laure Boutigny","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Anne-Laure","middleName":"","lastName":"Boutigny","suffix":""},{"id":324359246,"identity":"1f87cd7d-2076-4c93-98b2-74ae7a6fac0d","order_by":4,"name":"Thomas Brauge","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Thomas","middleName":"","lastName":"Brauge","suffix":""},{"id":324359247,"identity":"ce298e4c-602b-480e-93c5-11eb074e43ff","order_by":5,"name":"Marina Cavaiuolo","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Marina","middleName":"","lastName":"Cavaiuolo","suffix":""},{"id":324359248,"identity":"ba5912e9-9c9a-4ea9-b2d0-5335fcce97cf","order_by":6,"name":"Amandine Cunty","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Amandine","middleName":"","lastName":"Cunty","suffix":""},{"id":324359249,"identity":"1fa6993c-eaff-4a39-993e-516a2d8208b9","order_by":7,"name":"Antoine Regnier","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Antoine","middleName":"","lastName":"Regnier","suffix":""},{"id":324359250,"identity":"bd7a1cd5-ff58-4628-af73-5a58c516667a","order_by":8,"name":"Maroua Sayeb","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Maroua","middleName":"","lastName":"Sayeb","suffix":""},{"id":324359251,"identity":"7bf3682d-61cc-44b5-b7e3-d0d3f7ea4958","order_by":9,"name":"Noémie Vingadassalon","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Noémie","middleName":"","lastName":"Vingadassalon","suffix":""},{"id":324359252,"identity":"e6061709-bb80-4e97-83f0-c479a8e87f1a","order_by":10,"name":"Claire Yvon","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"Claire","middleName":"","lastName":"Yvon","suffix":""},{"id":324359253,"identity":"92d9b51f-485e-46ad-8bf2-375b904ffdfc","order_by":11,"name":"virginie chesnais","email":"","orcid":"","institution":"University Paris Est, ANSES","correspondingAuthor":false,"prefix":"","firstName":"virginie","middleName":"","lastName":"chesnais","suffix":""}],"badges":[],"createdAt":"2024-07-05 12:22:46","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4692225/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4692225/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12864-024-10982-z","type":"published","date":"2024-11-08T15:57:28+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":61439826,"identity":"284309a2-a987-4fe2-97c6-2c8db6e6486d","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1043164,"visible":true,"origin":"","legend":"\u003cp\u003eRepresentation of our study’s experimental design a. Experimental design for simulated data. Quality parameters, including read quality and sequencing depth, were assessed. Reproducibility was evaluated through three simulated read datasets. The repeatability of each assembler (SPAdes, Shovill and Unicycler) was tested with three repetitions of an assembly; b. Experimental design for real data. Two independent DNA extractions from each strain were sequenced independently. The three assemblers were compared for each sequencing dataset, and repeatability was assessed through three repetitions of an assembly.\u003c/p\u003e","description":"","filename":"figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/76ac8f151744a6d7e5506f94.png"},{"id":61440177,"identity":"aca95a90-17db-4f3c-bd18-3754809d3805","added_by":"auto","created_at":"2024-07-30 19:36:46","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":18935082,"visible":true,"origin":"","legend":"\u003cp\u003eCircos plots of assembled contig alignments to a reference genome used for read simulations. The three simulated sequencing datasets and the three replicates for each assembly tool are represented (27 assemblies per genome) for a depth of 75x. The results from SPAdes are in green, those from Shovill in turquoise, and those from Unicycler in dark turquoise.\u003c/p\u003e","description":"","filename":"figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/4f41934ed1f805829681fdb9.png"},{"id":61439832,"identity":"a262f298-fa7d-4233-b4ba-763969c3e335","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":754232,"visible":true,"origin":"","legend":"\u003cp\u003eViolin plot of allelic distribution rates according to the gene number in the cgMLST scheme. The results obtained with Unicycler (red) and with Shovill (grey) assemblies were obtained using simulated reads with a Phred score greater than Q40 and a depth of 75x. a: species for which the distribution of allelic difference rates is less than 5%. b: species for which the distribution of allelic difference rates is greater than 5%. The \u003cem\u003ep\u003c/em\u003e-values were calculated with the non-parametric Mann-Whitney test, and significance is represented by *.\u003c/p\u003e","description":"","filename":"figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/d5d2ef98e2c3faa697044646.png"},{"id":61439828,"identity":"d8fd1670-b735-4059-a22d-7b198e6643a5","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1036933,"visible":true,"origin":"","legend":"\u003cp\u003eViolin plot of distribution rates of allelic differences to genes number in the cgMLST scheme. Results obtained from simulated data with a Phred score greater than Q40 and a sequencing depth of 75x in red, or a depth lower than 75x in grey. Shovill was used for assemblies a: for which the distribution of allelic difference rates is less than 5%. b: for species whose distribution of allelic difference rates is greater than 5%. The \u003cem\u003ep \u003c/em\u003evalueswere calculated according to the non-parametric Mann-Whitney test, and significance is denoted by *.\u003c/p\u003e","description":"","filename":"figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/3dd520abe8b88a55bf7f2b16.png"},{"id":61440178,"identity":"c7632f48-65e6-48c3-a5c3-eb80dab74b1b","added_by":"auto","created_at":"2024-07-30 19:36:46","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":2078945,"visible":true,"origin":"","legend":"\u003cp\u003eMinimum spanning tree (MST) obtained from cgMLST profiles using real data. From left to right: results from SPAdes, Shovill and Unicycler. Each color represents one strain, for which two biological replicates were performed; the circle size indicates the number of assemblies sharing the same cgMLST profile, and allelic differences are indicated on the branches. The completeness value corresponds to the percentage of the gene scheme used to perform analyses.\u003c/p\u003e","description":"","filename":"figure5.png","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/ed0740d6ddd659b9cedb201e.png"},{"id":68750107,"identity":"501c1cfd-a364-424e-875b-f7006cd45967","added_by":"auto","created_at":"2024-11-11 16:09:48","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":19054077,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/a1b47d5b-3dae-4597-b11d-a5fcbf5805af.pdf"},{"id":61439830,"identity":"73dce3a2-a620-4b50-8607-f1aef15ee5dc","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"xlsx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":31598,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementalTableS1S2S3S4.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/7d391927cb96c3f16bdbeb8d.xlsx"},{"id":61439834,"identity":"2ef5ab64-b360-4ec7-911c-443eeacef12e","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":2885708,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaldataS1.docx","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/feb2c2dc4dd37888101cf3e8.docx"},{"id":61439829,"identity":"180c422c-3b16-4201-959d-786306e900a7","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":1622649,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaldataS2.docx","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/a6a48d973b675fd521421e6e.docx"},{"id":61439833,"identity":"183a866d-0347-4a13-ba95-31624107bc42","added_by":"auto","created_at":"2024-07-30 19:28:46","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":5255004,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaldataS3.docx","url":"https://assets-eu.researchsquare.com/files/rs-4692225/v1/3434319735bac53a07790286.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Unraveling the Impact of Genome Assembly on Bacterial Typing: A One Health Perspective","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003eIn a One Health perspective, it is essential to maintain a global system of surveillance to better perceive and understand transmission events between animals, humans, and the environment. These surveillance systems need to be harmonized and to ensure interoperability between all the data generated so that they may be shared among all surveillance players, such as public health authorities, research institutions, and laboratories. These systems also involve several scientific domains, such as plant pathology or veterinary, medical, and food safety. The importance of such sharing of data has recently been proven for real-time monitoring of outbreaks or pandemics, as highlighted during the SARS-CoV-2 pandemic or other recent virus outbreaks (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). Such systems are already used in bacteria monitoring systems to identify the origins and transmission routes of antimicrobial resistance (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e), or to monitor food-associated pathogens. Recommendations have thus been proposed to facilitate collaboration around data (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e), EFSA (2022). These recommendations suggested in particular (i) defining quality criteria so as to ensure data trustworthiness, and (ii) providing guidelines and reference analytical tools for data processing while limiting the impact of their storage. To implement these recommendations, current systems for bacteria surveillance are primarily based on typing results (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe reference method for bacterial typing is multi-locus sequence typing (MLST), based on seven housekeeping genes. It was developed for the first time in 1998 with \u003cem\u003eNeisseria meningitidis\u003c/em\u003e and since then, the number of schemes available in the pubMLST database has steadily increased to over 130, demonstrating the ongoing growth and diversification of this typing method over time (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). In the last few decades, the development of whole genome sequencing (WGS) has opened the path to gene-by-gene approaches to extend the MLST concept to all genes composing the core genome (cg) of bacterial species. This method, called cgMLST, is more discriminating than MLST due to its higher genome coverage level.\u003c/p\u003e \u003cp\u003eZoonotic and foodborne pathogen surveillance is increasingly based on these new approaches, and most of the surveillance initiative tools published recently recommend using cgMLST outputs for comparing bacterial strains and identifying clusters of genetically-related strains (PulseNet USA (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e), GenoSalmSurv (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e), EFSA (2022)). Recently, an outbreak caused by \u003cem\u003eListeria monocytogenes\u003c/em\u003e ST1247 was investigated in five European countries (Denmark, Estonia, Finland, France, and Sweden), using the cgMLST approach (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e). In this study, only three allelic differences were found out of the 1744 loci detected from the 1748-loci cgMLST scheme (10). Likewise, this method was used to investigate the global outbreak caused by \u003cem\u003eSalmonella Typhimurium\u003c/em\u003e ST34 in chocolate-based products between 2021 and 2022. Cases were reported in 12 European Union countries, the UK, Switzerland, USA, and Canada (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eUnlike methods based on read mapping, a variant that requires a reference genome to which reads are aligned, the gene-by-gene approach is reference-free, enabling better consideration of genetic variability among bacterial strains. Moreover, cgMLST appears to be less affected by homologous recombination than SNP analysis, and can be used to investigate outbreaks from highly recombinant pathogens like \u003cem\u003ePseudomonas aeruginosa\u003c/em\u003e (\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e), \u003cem\u003eSalmonella enterica\u003c/em\u003e (\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e) or \u003cem\u003eXylella fastidiosa\u003c/em\u003e (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). Furthermore, it is straightforward to establish nomenclature systems that can be shared among multiple institutes and/or analyses, facilitating the creation of a global monitoring system. These schemes and sequence variants are publicly available in several databases, e.g., PubMLST (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://pubmlst.org/\u003c/span\u003e\u003cspan address=\"https://pubmlst.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), BIGSdb-Pasteur (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://bigsdb.pasteur.fr\u003c/span\u003e\u003cspan address=\"https://bigsdb.pasteur.fr\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), EnteroBase (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://enterobase.warwick.ac.uk/\u003c/span\u003e\u003cspan address=\"https://enterobase.warwick.ac.uk/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), cgmlst.org (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://cgmlst.org/ncs\u003c/span\u003e\u003cspan address=\"https://cgmlst.org/ncs\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) from Ridom SeqSphere and Chewie-NS (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://chewie-ns.readthedocs.io/en/latest/\u003c/span\u003e\u003cspan address=\"https://chewie-ns.readthedocs.io/en/latest/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). There are different approaches to calling alleles and obtaining cgMLST profiles. One of them maps direct reads to a scheme to call genes, as implemented in Mentalist (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). A second approach, implemented in ChewBBACA (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e), is assembly-based, and requires genome assembly before calling cgMLST profiles. Various systems use it, like INNUENDO (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). ChewBBACA is also implemented in an interoperable system shared by the European Food Safety Authority (EFSA) and the European Centre for Disease Prevention and Control (ECDC), which was set up in 2019 to analyze foodborne outbreaks caused by \u003cem\u003eSalmonella enterica\u003c/em\u003e, \u003cem\u003eListeria monocytogenes\u003c/em\u003e, and \u003cem\u003eEscherichia coli\u003c/em\u003e (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cem\u003eDe novo\u003c/em\u003e assembly is a crucial step after sequencing to reconstruct the genomes of pathogens. Several pipelines designed to harmonize genome assembly have been published based on specific pathogens or institutes. These pipelines use \u003cem\u003ede novo\u003c/em\u003e assembly tools like SPAdes (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e), Shovill (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e) or Unicycler (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e), and short reads as the data input. One of the significant challenges in bacterial genome assembly is the use of short reads produced by next generation sequencing (NGS). Indeed, NGS tools can be easily impacted by genome composition, for example the occurrence of repeated sequences such as insertion sequences (IS), variable number tandem repeats (VNTRs), or homopolymers, which are very difficult to assemble. In addition, regions that vary greatly in GC composition have a poor sequencing coverage, leading to genome fragmentation (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe aim of this study was to evaluate the impact of assembly tools on bacteria to highlight the need for pipeline harmonization and to share cgMLST profiles with the EFSA/ECDC system, where cgMLST analyses are performed with ChewBBACA. Twenty-seven bacterial species corresponding to significant pathogens from a One Health perspective were examined in this study. These species encompass foodborne, plant, and animal pathogens. We compared the three tools most frequently used for assembly purposes: SPAdes (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e), Unicycler (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) and Shovill (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e). The effect of the quality and depth of sequenced reads was evaluated on cgMLST results. The repeatability and reproducibility of analyses were also tested using both \u003cem\u003ein silico\u003c/em\u003e and \u003cem\u003ein vitro\u003c/em\u003e sequencing. We observed a major bioinformatic variability in the cgMLST profiles obtained, and therefore proposed recommendations to enhance interoperability between genomic results and to decrease the risk of excluding strains linked to each other in epidemic clusters.\u003c/p\u003e"},{"header":"2. Material and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Experimental scheme\u003c/h2\u003e \u003cp\u003eThe genomes of 27 bacterial pathogen species\u0026mdash;\u003cem\u003eBacillus cereus\u003c/em\u003e, \u003cem\u003eBacillus thuringiensis\u003c/em\u003e, \u003cem\u003eBacillus cytotoxicus, Brucella melitensis, Burkholderia mallei, Campylobacter spp., Citrobacter spp., Clostridium botulinum, Clostridium difficile, Clostridium perfringens, Escherichia coli, Klebsiella aerogenes, Leptospira interrogans, Listeria monocytogenes, Mycobacterium bovis, Mycobacterium tuberculosis, Neisseria meningitides, Pseudomonas aeruginosa, Ralstonia solanacearum, Salmonella enterica, Staphylococcus argenteus, Staphylococcus aureus, Taylorella equigenitalis, Vibrio cholera, Vibrio parahaemolyticus, Xylella fastidiosa\u003c/em\u003e, and \u003cem\u003eYersinia enterocolitica\u003c/em\u003e\u0026mdash;were used to perform these analyses (Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). The species were chosen according to the interest in these pathogens for public health, and their risk in food safety. A minimum of five circularized genomes were randomly chosen from the public NCBI database, resulting in 140 genomes being analyzed. All strain accession numbers are available in the supplementary data (Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe experimental design is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea. The short read paired end of 150 bp was simulated using ART v. 2.3.7 (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e) to mimic Illumina sequencing. Phred quality scores (Q) for Illumina sequencing are guaranteed to be at least 95% above Q30 for all platforms, such as MiSeq, HiSeq and NextSeq.\u0026nbsp;Two quality scores were then simulated: greater than Q40 to simulate high-quality reads and less than Q40 to estimate the impact of low-quality reads. The depth of sequencing can also differ depending on the multiplexing and sequencing platforms chosen. Because sequencing depth can affect genome assembly results, five different depths were simulated: 25x, 50x, 75x, 100x and 150x. The reproducibility of assembly, tested by comparing assembly following independent read simulations and cgMLST typing, was evaluated for three different simulated datasets of high-quality reads. Thus, a total of 2800 reads were simulated, with each genome undergoing 20 simulations. Read simulations were verified using fastp v. 0.20.1 (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Real dataset\u003c/h2\u003e \u003cp\u003e \u003cem\u003eIn vitro\u003c/em\u003e sequencing data were used to validate simulation results for six bacterial species. The experimental design is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb. We used 28 different strains: five for \u003cem\u003eBacillus thuringiensis\u003c/em\u003e, five for \u003cem\u003eListeria monocytogenes\u003c/em\u003e, five for \u003cem\u003eSalmonella enterica\u003c/em\u003e, five for \u003cem\u003eStaphylococcus aureus\u003c/em\u003e, four for \u003cem\u003eVibrio parahaemolyticus\u003c/em\u003e, and four for \u003cem\u003eXylella fastidiosa\u003c/em\u003e. (Table \u003cspan refid=\"MOESM2\" class=\"InternalRef\"\u003eS2\u003c/span\u003e). DNA was extracted from all these strains and sequenced independently twice. Quality was assessed and reads were trimmed using fastp v. 0.20.1 (\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e). Finally, a total of 56 sequencing results were analyzed.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Assembly\u003c/h2\u003e \u003cp\u003eIn order to evaluate the impact of assembly tools on cgMLST typing, three tools were selected: SPAdes v.3.14.1 (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e), Shovill v.1.0.9 (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e), and Unicyler v.0.4.8 (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) using default settings. All the simulated and real sequenced reads were assembled with these three tools. To validate the repeatability of genome assembly by comparing assemblies obtained with the same tool and the same dataset simulation, each tool was used independently three times on high-quality simulated reads with a Phred score above 40 and a depth exceeding 75x. Real sequenced reads were also assembled independently three times. In all, 12,156 assemblies were generated for simulated data and 1296 assemblies for \u003cem\u003ein vitro\u003c/em\u003e data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Typing\u003c/h2\u003e \u003cp\u003eAll assemblies listed in Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e (n\u0026thinsp;=\u0026thinsp;140) were analyzed to generate the corresponding cgMLST profiles using chewBBACA v. 2.8.5, as recommended by the EFSA/ECDC system. Whenever possible, we used publicly available schemes from cgmlst.org or Big-SDB (Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e). For \u003cem\u003eTaylorella equigenitalis\u003c/em\u003e and \u003cem\u003eXylella fastidiosa\u003c/em\u003e, unpublished schemes were used to obtain cgMLST profiles with chewBBACA. The EFSA/ECDC system recommends using chewBBACA v. 2.8.5 or more recent versions (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). In our study, cgMLST profiles were computed using chewBBACA v. 2.8.5 tools after assembly annotation using Prodigal (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5 Assembly quality parameters and visualization of cgMLST results\u003c/h2\u003e \u003cp\u003eIn order to compare assembly quality, four parameters from Quast results were analyzed (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e). To evaluate genome fragmentation, we compared contig numbers and largest contig sizes in all the assemblies. To assess assembly truthfulness, the number of misassemblies were detected by comparison with the initial genome and NGA50.\u003c/p\u003e \u003cp\u003eFor each strain of all 27 species, assembly results were aligned with minimap2 (Li, 2018) implemented in Quast to the initial reference genome used for read simulation. Alignment was used to visualize contig fragmentation and evaluate assembly reproducibility and repeatability. The python library (seaborn v. 0.11.2 (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e) and Circos v. 0.1.3 (\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e)) were used for all visualizations.\u003c/p\u003e \u003cp\u003eThe cgMLST profiles of simulated datasets were compared by computing the allelic differences between genomes from NCBI and assembly results with GrapeTree v. 2.1 (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e) after normalization. To obtain a completeness percentage for each scheme, this normalization step focused on the gene number in the scheme for each species analyzed (Table \u003cspan refid=\"MOESM3\" class=\"InternalRef\"\u003eS3\u003c/span\u003e). The completeness was calculated on the basis of genes found by cgMLST analysis compared with the total number of genes in each scheme. The cgMLST results from real data were analyzed using the minimum spanning tree calculated with GrapeTree (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e) and the MSTreeV2 method. These trees were visualized using the GrapeTree web application (achtman-lab.github.io/GrapeTree/MSTree_holder.html).\u003c/p\u003e "},{"header":"Results","content":" \u003cp\u003e \u003cb\u003eEvaluation of assembly reproducibility according to sequencing quality using simulated data\u003c/b\u003e \u003c/p\u003e \u003cp\u003eA key requirement for sharing data between interoperable surveillance systems is to evaluate the repeatability and reproducibility of analysis and to propose quality criteria for data inclusion. The assembly tools chosen (SPAdes, unicycler and Shovill) were selected because they have been frequently used in recently published workflows dedicated to bacterial WGS. We evaluated the impact of read quality on sequencing simulations for 27 bacterial species, and observed that poor data quality (Q\u0026thinsp;\u0026lt;\u0026thinsp;40) decreases the quality of assembly: Assemblies were impossible to draft with Shovill, because the tool did not accept input data, or were shorter and more fragmented with SPAdes and Unicycler (Supplementary data S1). For \u003cem\u003eVibrio parahaemolyticus\u003c/em\u003e, the maximum number of contigs was 80 with high-quality data (Q\u0026thinsp;\u0026gt;\u0026thinsp;40) but increased to 120 with poor-quality data. For some species, such as \u003cem\u003eBacillus cereus, Clostridium perfringens\u003c/em\u003e, \u003cem\u003eTaylorella Mycobacterium tuberculosis\u003c/em\u003e, and \u003cem\u003eRalstonia solanacearum\u003c/em\u003e, some genome parts were even missing from the final assembly obtained with a poor read quality (Supplementary data S2), in position 0 Mb for \u003cem\u003eBacillus cereus\u003c/em\u003e, 0.1 Mb for \u003cem\u003eClostridium perfringens\u003c/em\u003e, 4.0 Mb for \u003cem\u003eMycobacterium tuberculosis\u003c/em\u003e, and 2.8 Mb for \u003cem\u003eRalstonia solanacearum\u003c/em\u003e.\u003c/p\u003e \u003cp\u003eFurthermore, the poor quality of reads also increased genome misassemblies compared with results obtained with a high read quality. Indeed, in \u003cem\u003eKlebsiella aerogenes\u003c/em\u003e, at a depth of 75x, the maximum percentage of misassemblies was 40% with poor-quality reads whereas the figure for high-quality reads could drop as far as 0%. For example, in \u003cem\u003eMycobacterium bovis\u003c/em\u003e, there were 20% of misassemblies with poor-quality reads vs. 0% with high-quality reads; in \u003cem\u003eNeisseria meningitides\u003c/em\u003e these figures were 40% (poor quality) vs. 20% (high quality); in \u003cem\u003eStaphylococcus argenteus\u003c/em\u003e they were 20% (poor quality) vs. 0%; and in \u003cem\u003eBacillus cereus\u003c/em\u003e, 20% (poor quality) vs. 7%. For \u003cem\u003eClostridium perfringens\u003c/em\u003e, the rate of misassemblies obtained with a poor read quality could reach 60% in some assemblies. For other species such as \u003cem\u003eCampylobacter spp., Listeria monocytogenes\u003c/em\u003e, \u003cem\u003eEscherichia coli\u003c/em\u003e or \u003cem\u003eVibrio cholerae\u003c/em\u003e, assembly results appeared to be less affected by a poor read quality (Supplementary data S1).\u003c/p\u003e \u003cp\u003eWhen we compared the impact of various sequencing depths, we observed an optimal threshold at 75x. At this value, parameters representing high-quality assembly are maximized, i.e., the number of contigs and misassemblies decrease, and both N50 and total length increase. Mahn-Whitney tests used to compare the four-parameter distribution obtained at different sequencing depths were significant (Table \u003cspan refid=\"MOESM4\" class=\"InternalRef\"\u003eS4\u003c/span\u003e). Results with 150x and 100x were identical. Comparing 25x with 100x, contig number distributions were significantly different for 10/27 species, N50 distributions significantly different for 21/27 species, misassemblies for 25/27 species, and largest contig for 16/27 species. For 50x, no difference was observed in contig number, N50 and largest contig, while misassembly distributions were different for 10/27 species. For 75x, no difference was observed in contig number, N50, and largest contig, while misassembly distributions were different for 6/27 species. Therefore, for the subsequent analyses, we present results derived from high-quality reads at a depth of 75x (Supplementary data S3).\u003c/p\u003e \u003cp\u003e \u003cb\u003eComparison of assembly tools with a high read quality and sufficient depth using simulated data\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo determine which tool performs better in genome assembly, SPAdes, Shovill and Unicycler were compared using simulated sequencing data with a high quality and mean depth of 75x. Our results indicated that assembly repeatability does not depend on the tools used but instead appears to be genome-dependent. An alignment of the generated assemblies to the reference used for the sequencing simulation revealed that both Shovill and Unicycler performed better for \u003cem\u003eListeria monocytogenes\u003c/em\u003e and \u003cem\u003eRalstonia solanacearum\u003c/em\u003e than for most the 27 bacterial species (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA). Interestingly, these tools fragmented the genome into similar genomic regions, which seem to correlate with variations in GC content across the genome. However, assembling the genome of \u003cem\u003eMycobacterium bovis\u003c/em\u003e and \u003cem\u003eXylella fastidiosa\u003c/em\u003e with the same assembly tool led to different results (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eB). Specifically, for these two species, assembly replicates obtained from the same simulated dataset produced identical contigs, as was observed for all studied genomes in our dataset, but for these two species, the assembly differed for each sequencing simulation dataset (i.e., read simulations obtained from the same genome).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eImpact of assembly tools on cgMLST profiles using simulated data\u003c/p\u003e \u003cp\u003eOnce the optimum quality criteria for sequencing were determined, the impact of cgMLST analyses was evaluated for 21 species for which a cgMLST scheme was available. The cgMLST profiles obtained from high-quality sequencing (i.e., Q\u0026thinsp;\u0026gt;\u0026thinsp;40) with sufficient depth (i.e., depth\u0026thinsp;=\u0026thinsp;75X) classified bacterial species into two categories based on the allelic difference rates observed between the reference genome and the assemblies obtained (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Results from SPAdes consistently exhibited higher assembly fragmentation and misassemblies than those obtained with Shovill and Unicycler, and are not therefore presented here. The first category (group 1) comprised 14 out of 21 bacterial species that had less than 5% of errors between the reference and the assembly obtained. For group 1, results suggested that the choice of assembler should vary according to the species studied (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). Indeed, for \u003cem\u003eEscherichia. coli\u003c/em\u003e, \u003cem\u003eMycobacterium tuberculosis\u003c/em\u003e, \u003cem\u003eVibrio cholerae\u003c/em\u003e, and \u003cem\u003eTaylorella equigenitalis\u003c/em\u003e, a significant difference (\u003cem\u003ep-value\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;5% for Mann-Whitney test) was observed between Shovill and Unicycler results, suggesting that Shovill gave cgMLST profiles closest to the reference. However, for \u003cem\u003eNeisseria meningitidis\u003c/em\u003e and \u003cem\u003eLeptospira interrogans\u003c/em\u003e, the allelic profiles were closest to the reference when Unicycler was used, although no significant difference was observed when checked with the Mann-Whiney test.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe second category (group 2) comprised 7 out of 21 bacterial species for which the number of allelic differences between the reference and the assembly obtained was greater than 5% (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eB), with a maximum of 30% for \u003cem\u003eSalmonella enterica\u003c/em\u003e. Within group 2, few differences were observed between the results obtained from Shovill and Unicycler assemblies, suggesting that the choice of assembly tool may be negligible compared with the intrinsic genome composition, except for \u003cem\u003eCampylobacter spp.\u003c/em\u003e for which a significant difference was observed between distribution results from the two tools.\u003c/p\u003e \u003cp\u003e \u003cb\u003eComparison of cgMLST profiles obtained with different sequencing depths using simulated data\u003c/b\u003e \u003cdiv class=\"BlockQuote\"\u003e \u003cp\u003eRelated strains were identified by clustering cgMLST profiles obtained with different data quality and depth combinations. In open-source surveillance systems or applications, various data qualities can be shared with the science community with diverse internal sequencing capacities and/or quality thresholds. To evaluate the impact of various sequencing depths on cgMLST results, we compared simulated sequencing data associated with mean depths of 25x, 50x, and 75x. The number of allelic differences between reference cgMLST profiles and cgMLST profiles obtained significantly increased for assemblies with a sequencing depth less than 75x for all species belonging to group 1 (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea). Only four out of 21 bacterial species, all belonging to group 2 previously described (i.e., greater than 5%), appeared not to be impacted by the quality of sequenced data: \u003cem\u003eBacillus cereus, Bacillus cytotoxicus, Bacillus thuringiensis\u003c/em\u003e, and \u003cem\u003eVibrio parahaemolyticus\u003c/em\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb), as no significant difference was observed. However, for other species\u0026mdash;regardless of whether they belong to the first or second group previously described\u0026mdash;the number of allelic differences was significantly higher with poor depth (Q\u0026thinsp;\u0026lt;\u0026thinsp;40) using simulated sequencing data. These results underscored the importance of performing genomic typing on harmonized, high-quality data with a sufficient sequencing depth to investigate outbreaks.\u003c/p\u003e \u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eConfirmation of reproducibility and repeatability when sequencing real data\u003c/b\u003e \u003c/p\u003e \u003cp\u003eTo confirm the poor repeatability and reproducibility of cgMLST results obtained using simulated sequencing data and evaluate the impact on real data, we analyzed biological replicates of bacterial strains from six species. The cgMLST profiles were computed for each biological replicate to evaluate reproducibility, and bioinformatics analyses were performed in triplicate to investigate repeatability.\u003c/p\u003e \u003cp\u003eThe cgMLST profiles obtained using real data showed that the results were repeatable between analyses, as also observed with simulated sequencing. Indeed, the cgMLST profiles resulting from SPAdes and Unicycler assemblies were comparable between each replicate, indicating 100% repeatability, as no distance was observed between assemblies obtained from the same raw data (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). However, poor reproducibility was observed between the biological replicates, with distances observed between the same strains for which raw data were provided from two independent extractions. This finding suggests that the wet lab part has a major impact on cgMLST profiles, despite using the same DNA extraction protocol for \u003cem\u003eSalmonella enterica\u003c/em\u003e, \u003cem\u003eStaphylococcus aureus\u003c/em\u003e, and \u003cem\u003eXylella fastidiosa\u003c/em\u003e. Indeed, only four out of 28 strains had identical profile results with Unicycler. With Shovill, repeatability seemed to be dependent on the species. For instance, for \u003cem\u003eListeria monocytogenes\u003c/em\u003e all analyses were 100% identical, whereas for \u003cem\u003eStaphylococcus aureus, Vibrio parahaemolyticus\u003c/em\u003e, and \u003cem\u003eXylella fastidiosa\u003c/em\u003e the strains had different cgMLST profiles resulting from distinct assemblies. For \u003cem\u003eSalmonella enterica\u003c/em\u003e and \u003cem\u003eBacillus thuringiensis\u003c/em\u003e, one and two strains, respectively, gave different cgMLST profiles between analyses, but only one gene was systematically affected.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe cgMLST profiles for biological replicates were found to be identical for eight out of 28 analyzed strains (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). These eight strains belong to \u003cem\u003eBacillus thuringiensis\u003c/em\u003e (two out of five strains), \u003cem\u003eListeria monocytogenes\u003c/em\u003e (four out of five strains), \u003cem\u003eVibrio parahaemolyticus\u003c/em\u003e (one out of four strains), and \u003cem\u003eSalmonella enterica\u003c/em\u003e (one out of five strains). This level of reproducibility was mainly observed for the results generated by SPAdes and Unicycler, although only the Unicycler results maximized the completeness of the cgMLST scheme, i.e., more genes in the cgMLST scheme were found after Unicycler assembly. Conversely, with Shovill, only five strains had the same cgMLST profiles for biological replicates (one \u003cem\u003eBacillus thuringiensis\u003c/em\u003e, and four \u003cem\u003eListeria monocytogenes\u003c/em\u003e), and only four strains gave profiles that were identical to the Unicycler results (one \u003cem\u003eBacillus thuringiensis\u003c/em\u003e and three \u003cem\u003eListeria monocytogenes\u003c/em\u003e).\u003c/p\u003e \u003cp\u003eThe number of allelic differences between biological replicates was found to be elevated (22 allelic differences between two \u003cem\u003eListeria monocytogenes\u003c/em\u003e replicates or 184 between two \u003cem\u003eStaphylococcus aureus\u003c/em\u003e replicates), suggesting potential ambiguity for closely-related strains (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). Depending on the species and assembly tools used, the number of allelic differences between biological replicates varied significantly, ranging from 10 allelic differences for \u003cem\u003eBacillus thuringiensis\u003c/em\u003e, to 138 for \u003cem\u003eSalmonella enterica\u003c/em\u003e with Unicycler. Results obtained for two closely-related strains of \u003cem\u003eXylella fastidiosa\u003c/em\u003e subsp. \u003cem\u003emultiplex\u003c/em\u003e, both belonging to ST6 based on the MLST of seven housekeeping genes (Amandine Cunty, personal communication), were mixed for cgMLST results, whereas they were found to be distinguishable in SNP analyses (data not shown). These results suggested that for outbreak investigations using this method, it may be challenging to discriminate the strain responsible for the outbreak and consequently determine its source.\u003c/p\u003e "},{"header":"Discussion","content":" \u003cp\u003ecgMLST typing is one of the most widely used genomic methods for surveillance of bacterial pathogens. Our study aimed to investigate how the assembly step influences cgMLST profiles. Our results indicated that assembly-based cgMLST analyses, considering the entire scheme, may vary depending on the assembly method used. This represents a significant limitation for the gene-by-gene approach in interoperable systems, which aggregate data from various analytical pipelines. However, the observed differences, often referred to as false negatives, primarily involve genes that are missing rather than allelic differences potentially resulting in different allelic combinations.\u003c/p\u003e \u003cp\u003eThe results obtained in this study highlight an impact of assembly on cgMLST profiles that is greater for particular bacterial species. Indeed, genomic composition may influence assembly quality, leading to possible contig fragmentation within a cgMLST gene. Repeat sequences such as insertion sequences (IS) or VNTRs can influence assembly quality, among other factors. A previous study demonstrated that the number of contigs obtained after assembly was correlated with the number of repeat elements in genomes (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e). The variability in GC content can also lead to non-reproducible analyses (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e) due to biases introduced during sequencing, which alter sequencing depth in these regions (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e). Moreover, increased variability in a genome leads to a higher degree of bias observed during sequencing. This bias affects all assembly methods using short reads, since the corresponding tools are not capable of effectively handling inconsistent sequencing depths. Although Unicycler showed better performance in reducing misassemblies than SPades (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) and Shovill, all three tools produced similar results in terms of genome contig fragmentation.\u003c/p\u003e \u003cp\u003eThe ability of a pathogen to capture external DNA by homologous recombination can directly impact GC content in recombination hotspots (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e). Thus, the difficulty in assembling genomes could be more pronounced for bacterial species with more frequent homologous recombination. Our results revealed two distinct groups with less than or more than 5% of allelic differences, respectively. Group 1, for which an allelic variation lower than 5% was described, included \u003cem\u003eListeria monocytogenes, Staphylococcus aureus\u003c/em\u003e, and \u003cem\u003eBrucella melitensis\u003c/em\u003e, among others. For these species, mutations were identified as the primary evolutionary force responsible for polymorphism (\u003cspan additionalcitationids=\"CR34\" citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e). In contrast, within the second group\u0026mdash;exemplified by \u003cem\u003eXylella fastidiosa\u003c/em\u003e and \u003cem\u003eSalmonella enterica\u003c/em\u003e\u0026mdash;strains had cgMLST results that were significantly different from those of the reference, indicating that recombination was the main evolutionary force (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn addition to intrinsic genomic composition, our results showed that sequencing quality affected cgMLST-typing. A recent study conducted with four food pathogens: \u003cem\u003eCampylobacter spp.\u003c/em\u003e, \u003cem\u003eListeria monocytogenes\u003c/em\u003e, \u003cem\u003eSalmonella enterica\u003c/em\u003e, and \u003cem\u003eEscherichia coli\u003c/em\u003e, demonstrated variability induced by the wet lab part of WGS analyses (\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e). In our study, we observed that bioinformatics analyses could also introduce variability in results. In a precedent study based on read simulations, the authors proposed a depth threshold at 50x based on analyses carried out on food pathogens \u003cem\u003eEscherichia coli\u003c/em\u003e, \u003cem\u003eListeria monocytogenes\u003c/em\u003e, and \u003cem\u003eSalmonella enteric\u003c/em\u003ea [38]. It should be noted that the analyses were conducted on a single strain per species, using a single tool (SPAdes) to compare typing results. However, by increasing the number of strains and the diversity of species investigated, our results showed that the quality of assembly obtained from 50x affected the typing result, and this bias decreased with depths equal to or greater than 75x. In the global monitoring systems, the diversity analyzed is even greater, and it is essential to evaluate these criteria for several distinct genomes per species. For this reason, we extended the study to 27 pathogens and included several genomes per species, allowing us to evaluate both the intra- and interspecies variability. This is why we proposed a minimum depth threshold of 75x for all pathogens.\u003c/p\u003e \u003cp\u003eOur results also showed that wet lab and bioinformatic variabilities can artificially increase the distance between related strains and thus impact outbreak investigations, potentially resulting in false negatives with unrelated strains. Indeed, when analyzing an epidemiological cluster, it is crucial to identify both the strains within the cluster and those excluded. This is based on a computation of allelic distance between strains (i.e., the number of differences between two profiles). Below a specific threshold, strains are considered related (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e). Thresholds for cgMLST clustering have been proposed for several bacterial species, including \u003cem\u003eListeria monocytogenes\u003c/em\u003e (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e), \u003cem\u003eEscherichia coli\u003c/em\u003e (\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e), \u003cem\u003eStaphylococcus aureus\u003c/em\u003e (\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e), and \u003cem\u003ePseudomonas aeruginosa\u003c/em\u003e (\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e), and several methods to estimate them have been developed based on modeling (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e) or nonparametric statistics (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e). However, in monitoring systems, such as Chewie-NS or GenoSalmSurv, the thresholds are applied exclusively to allelic differences, with the number of undiscovered loci frequently not taken into consideration. Yet, as we have shown in this study, the genome quality can highly affect the completeness of cgMLST results (i.e., the number of genes that are found during analysis). This parameter increases the weight for allelic differences. For example, the established threshold for \u003cem\u003eStaphylococcus aureus\u003c/em\u003e is 24 different alleles to define a cluster of related strains [42], with a complete cgMLST scheme comprising 1861 genes. However, our results were obtained using only 1005 genes. So, based on the reduction in the scheme\u0026rsquo;s completeness, the threshold should be reduced to 13 different alleles for this specific clustering analysis.\u003c/p\u003e \u003cp\u003eConsequently, for outbreak investigations, it may be beneficial to include the value of scheme completeness (as defined by Palma et al. (2022)), and to propose quality criteria, which maximize this value in monitoring systems. Other parameters\u0026mdash;such as homologous recombination and GC content\u0026mdash;could be taken into account by a gene-by-gene approach to scheme definition, as the GC bias could lead to major genome fragmentation in assembly analyses. However, these propositions should be balanced against the need to consider some of the evolutionary history of outbreaks, given that GC and recombination represent horizontal gene transfers (HGTs). Yet, these transfers are very important for the evolution of virulence among bacteria, as shown for \u003cem\u003eYersinia enterocolitca\u003c/em\u003e (\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e). As recently proposed by Duval et al., these thresholds should not be defined by species but rather by either outbreak, taking into account evolutionary parameters (such as mutation, duration, etc.) specific to outbreaks (\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e), or by specific lineages that could have a specific evolutionary mechanism (such as being highly clonal) compared with other lineages. Furthermore, the development of assembly-free methods like SNP approaches at pangenome level could facilitate outbreak investigations using the pangenome graph method.\u003c/p\u003e "},{"header":"Conclusion","content":" \u003cp\u003eOur study assessed the bioinformatic variability induced in bacterial typing analyses using the cgMLST method. By including foodborne and clinical pathogens, and using simulated and real data, our findings led us to propose new practices when implementing this method in surveillance systems, such as integrating the notion of completeness for outbreak investigation, and establishing minimum quality criteria for sequencing.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":" \u003cp\u003e \u003cstrong\u003eCompeting Interests\u003c/strong\u003e \u003cp\u003eThe authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThis work was supported by the SPAAD unit\u0026rsquo;s internal resources.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eV.C. conceived and designed the experiments. D. M. designed the analytical strategy and performed analyses. M.V.N. participated in analytical strategy and revised the paper. M.B., A.L.B., T. B., M.C., A.C., A.R., M.S., N.V., and C.Y. collected the samples and extracted the DNA for whole genome sequencing. D.M. and V.C. wrote and revised the paper. All the authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eWe thank Laurent Guillier for discussion about analyses, and we thank Delphine Libby-Claybrough for English editing.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eSequence data that support the findings of this study have been deposited in the NCBI with the primary accession code PRJNA1129992.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eOude Munnink BB, Sikkema RS, Nieuwenhuijse DF, Molenaar RJ, Munger E, Molenkamp R, et al. Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans. Sci 8 janv. 2021;371(6525):172\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChakraborty T, Barbuddhe S. Enabling One Health solutions through genomics. Indian J Med Res. 2021;153(3):273.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWheeler NE, Price V, Cunningham-Oakes E, Tsang KK, Nunn JG, Midega JT, et al. Innovations in genomic antimicrobial resistance surveillance. Lancet Microbe 1 d\u0026eacute;c. 2023;4(12):e1063\u0026ndash;70.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTimme RE, Wolfgang WJ, Balkey M, Venkata SLG, Randolph R, Allard M, et al. Optimizing open data to support one health: best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook. 2020;2(1):20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGerner-Smidt P, Hise K, Kincaid J, Hunter S, Rolando S, Hyyti\u0026auml;-Trees E, et al. PulseNet USA: A Five-Year Update. Foodborne Pathog Dis mars. 2006;3(1):9\u0026ndash;19.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaiden MCJ, Bygraves JA, Fell E, Morelli G, Russel JE, Urwin R, et al. Multilocus Sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A. 1998;95:3140\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eScharff RL, Besser J, Sharp DJ, Jones TF, Peter GS, Hedberg CW. An Economic Evaluation of PulseNet: A Network for Foodborne Disease Surveillance. Am J Prev Med mai. 2016;50(5 Suppl 1):S66\u0026ndash;73.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUelze L, Becker N, Borowiak M, Busch U, Dangel A, Deneke C, et al. Toward an Integrated Genome-Based Surveillance of Salmonella enterica in Germany. Front Microbiol [Internet]. 2021. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3389/fmicb.2021.626941\u003c/span\u003e\u003cspan address=\"10.3389/fmicb.2021.626941\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.frontiersin.org/journals/microbiology/articles/\u003c/span\u003e\u003cspan address=\"https://www.frontiersin.org/journals/microbiology/articles/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. 12. Disponible sur.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eM\u0026auml;esaar M, Mamede R, Elias T, Roasto M. Retrospective Use of Whole-Genome Sequencing Expands the Multicountry Outbreak Cluster of Listeria monocytogenes ST1247. Int J Genomics 1 avr. 2021;2021:1\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMoura A, Tourdjman M, Leclercq A, Hamelin E, Laurent E, Fredriksen N, et al. Real-Time Whole-Genome Sequencing for Surveillance of Listeria monocytogenes, France. Emerg Infect Dis sept. 2017;23(9):1462\u0026ndash;70.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEFSA. Multi-country outbreak of monophasic Salmonella Typhimurium sequence type 34 linked to chocolate products \u0026ndash; first update \u0026ndash; 18 May 2022. EFSA Support Publ juin 2022;19(6).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlanc DS, Magalh\u0026atilde;es B, Koenig I, Senn L, Grandbastien B. Comparison of Whole Genome (wg-) and Core Genome (cg-) MLST (BioNumericsTM) Versus SNP Variant Calling for Epidemiological Investigation of Pseudomonas aeruginosa. Front Microbiol 22 juill 2020;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDidelot X, Bowden R, Street T, Golubchik T, Spencer C, McVean G, et al. Recombination and population structure in Salmonella enterica. PLoS Genet juill. 2011;7(7):e1002191.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVanhove M, Retchless AC, Sicard A, Rieux A, Coletta-Filho HD, De La Fuente L et al. Genomic Diversity and Recombination among Xylella fastidiosa Subspecies. Appl Environ Microbiol juill 2019;85(13).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMamede R, Vila-Cerqueira P, Silva M, Carri\u0026ccedil;o JA, Ramirez M. Chewie Nomenclature Server (chewie-NS): a deployable nomenclature server for easy sharing of core and whole genome MLST schemas. Nucleic Acids Res 8 janv. 2021;49(D1):D660\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeijao P, Yao HT, Fornika D, Gardy J, Hsiao W, Chauve C et al. MentaLiST \u0026ndash; A fast MLST caller for large MLST schemes. Microb Genomics 1 f\u0026eacute;vr 2018;4(2).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSilva M, Machado MP, Silva DN, Rossi M, Moran-Gilad J, Santos S et al. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. Microb Genomics 1 mars 2018;4(3).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLlarena AK, Ribeiro-Gon\u0026ccedil;alves BF, Nuno Silva D, Halkilahti J, Machado MP, Da Silva MS, et al. INNUENDO: A cross-sectoral platform for the integration of genomics in the surveillance of food-borne pathogens. EFSA Support Publ. 2018;15(11):1498E.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCosta G, Di Piazza G, Koevoets P, Iacono G, Liebana E, Pasinato L et al. Guidelines for reporting Whole Genome Sequencing-based typing data through the EFSA One Health WGS System. EFSA Support Publ juin 2022;19(6).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol mai. 2012;19(5):455\u0026ndash;77.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSeemann T. Shovill: faster SPAdes assembly of Illumina reads. 2017.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWick RR, Judd LM, Gorrie CL, Holt KE, Unicycler. Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Comput Biol 8 juin. 2017;13(6):e1005595.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly. PLoS ONE. 29 avr. 2013;8(4):e62856.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinf 15 f\u0026eacute;vr. 2012;28(4):593\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinf 1 sept. 2018;34(17):i884\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinf 15 avr. 2013;29(8):1072\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWaskom M. seaborn: statistical data visualization. J Open Source Softw 6 avr. 2021;6(60):3021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: an information esthetic for comparative genomics. Genome Res. 2009;19(604):1639\u0026ndash;45.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou Z, Alikhan NF, Sergeant MJ, Luhmann N, Vaz C, Francisco AP, et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res sept. 2018;28(9):1395\u0026ndash;404.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAcu\u0026ntilde;a-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics. 16 d\u0026eacute;c. 2018;19(1):54.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMavromatis K, Land ML, Brettin TS, Quest DJ, Copeland A, Clum A, et al. The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation. PLoS ONE 12 d\u0026eacute;c. 2012;7(12):e48837.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLassalle F, P\u0026eacute;rian S, Bataillon T, Nesme X, Duret L, Daubin V. GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands. PLOS Genet 6 f\u0026eacute;vr. 2015;11(2):e1004941.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eden Bakker HC, Didelot X, Fortes ED, Nightingale K, Wiedmann M. Lineage specific recombination rates and microevolution in Listeria monocytogenes. BMC Evol Biol. 2008;8(1):277.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFraser C, Hanage WP, Spratt BG. Neutral microepidemic evolution of bacterial pathogens. Proc Natl Acad Sci 8 f\u0026eacute;vr. 2005;102(6):1968\u0026ndash;73.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVishnu US, Sankarasubramanian J, Sridhar J, Gunasekaran P, Rajendhran J. Identification of Recombination and Positively Selected Genes in Brucella. Indian J Microbiol 29 d\u0026eacute;c. 2015;55(4):384\u0026ndash;91.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePark CJ, Andam CP. Distinct but Intertwined Evolutionary Histories of Multiple Salmonella enterica Subspecies. mSystems. 11 f\u0026eacute;vr. 2020;5(1).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eForth LF, Brinks E, Denay G, Fawzy A, Fiedler S, Fuchs J et al. Impact of wet-lab protocols on quality of whole-genome short-read sequences from foodborne microbial pathogens. Front Microbiol 29 nov 2023;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDuval A, Opatowski L, Brisse S. Defining genomic epidemiology thresholds for common-source bacterial outbreaks: a modelling study. Lancet Microbe mai. 2023;4(5):e349\u0026ndash;57.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRadomski N, Cadel-Six S, Cherchame E, Felten A, Barbet P, Palma F et al. A Simple and Robust Statistical Method to Define Genetic Relatedness of Samples Related to Outbreaks at the Genomic Scale \u0026ndash; Application to Retrospective Salmonella Foodborne Outbreak Investigations. Front Microbiol. 24 oct 2019;10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSch\u0026uuml;rch AC, Arredondo-Alonso S, Willems RJL, Goering RV. Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene\u0026ndash;based approaches. Clin Microbiol Infect avr. 2018;24(4):350\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLagos AC, Sundqvist M, Dyrkell F, Stegger M, S\u0026ouml;derquist B, M\u0026ouml;lling P. Evaluation of within-host evolution of methicillin-resistant Staphylococcus aureus (MRSA) by comparing cgMLST and SNP analysis approaches. Sci Rep 22 juin. 2022;12(1):10541.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMartak D, Meunier A, Sauget M, Cholley P, Thouverez M, Bertrand X, et al. Comparison of pulsed-field gel electrophoresis and whole-genome-sequencing-based typing confirms the accuracy of pulsed-field gel electrophoresis for the investigation of local Pseudomonas aeruginosa outbreaks. J Hosp Infect ao\u0026ucirc;t. 2020;105(4):643\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKarlsson PA, Tano E, Jernberg C, Hickman RA, Guy L, J\u0026auml;rhult JD, et al. Molecular Characterization of Multidrug-Resistant Yersinia enterocolitica From Foodborne Outbreaks in Sweden. Front Microbiol. 2021;12:664665.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-genomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gics","sideBox":"Learn more about [BMC Genomics](http://bmcgenomics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/gics","title":"BMC Genomics","twitterHandle":"#BMCGenomics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4692225/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4692225/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eIn the context of pathogen surveillance, it is crucial to ensure interoperability and harmonized data. Several surveillance systems are designed to compare bacteria and identify outbreak clusters based on core genome MultiLocus Sequence Typing (cgMLST). Among the different approaches available to generate bacterial cgMLST, our research used an assembly-based approach (chewBBACA tool).\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eSimulations of short-read sequencing were conducted for 5 genomes of 27 pathogens of interest in animal, plant, and human health to evaluate the repeatability and reproducibility of cgMLST. Various quality parameters, such as read quality and depth of sequencing were applied, and several read simulations and genome assemblies were repeated using three tools: SPAdes, Unicycler and Shovill. In vitro sequencing were also used to evaluate assembly impact on cgMLST results, for 6 bacterial species: \u003cem\u003eBacillus thuringiensis, Listeria monocytogenes\u003c/em\u003e, \u003cem\u003eSalmonella enterica\u003c/em\u003e, \u003cem\u003eStaphylococcus aureus\u003c/em\u003e, and \u003cem\u003eVibrio parahaemolyticus\u003c/em\u003e.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe results highlighted variability in cgMLST, which appears unrelated to the assembly tools, but rather induced by the intrinsic composition of the genomes themselves. This variability observed in simulated sequencing was further validated with real data for five of the bacterial pathogens studied.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eThis highlights that the intrinsic genome composition affects assembly and resulting cgMLST profiles, that variability in bioinformatics tools can induce a bias in cgMLST profiles. In conclusion, we propose that the completeness of cgMLST schemes should be considered when clustering strains.\u003c/p\u003e","manuscriptTitle":"Unraveling the Impact of Genome Assembly on Bacterial Typing: A One Health Perspective","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-07-30 19:28:41","doi":"10.21203/rs.3.rs-4692225/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-07-08T19:40:17+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-07-06T03:33:53+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-07-06T03:33:36+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Genomics","date":"2024-07-05T12:21:20+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-genomics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"gics","sideBox":"Learn more about [BMC Genomics](http://bmcgenomics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/gics","title":"BMC Genomics","twitterHandle":"#BMCGenomics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"1ef69107-3082-497f-9edd-5cf418aca22d","owner":[],"postedDate":"July 30th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-11-11T16:05:01+00:00","versionOfRecord":{"articleIdentity":"rs-4692225","link":"https://doi.org/10.1186/s12864-024-10982-z","journal":{"identity":"bmc-genomics","isVorOnly":false,"title":"BMC Genomics"},"publishedOn":"2024-11-08 15:57:28","publishedOnDateReadable":"November 8th, 2024"},"versionCreatedAt":"2024-07-30 19:28:41","video":"","vorDoi":"10.1186/s12864-024-10982-z","vorDoiUrl":"https://doi.org/10.1186/s12864-024-10982-z","workflowStages":[]},"version":"v1","identity":"rs-4692225","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4692225","identity":"rs-4692225","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00