{"paper_id":"8cdd0cb3-e770-4636-9443-bfbdd34af678","body_text":"Authors: Dorottya Nagy1,2, Valentina Pennetta1, Gillian Rodger3, Katie Hopkins1, 1 \nChristopher R. Jones2, The NEKSUS Consortium†, Susan Hopkins2,4, Derrick Crook1, A. 2 \nSarah Walker1,4,5, Julie Robotham2,4, Katie L. Hopkins2,4,5, Alice Ledda2,4, David Williams2, 3 \nRussell Hope2,4, Colin S. Brown2,4, Nicole Stoesser1,4,6,7* Samuel Lipworth1,4,7* 4 \n† Group authorship and affiliations listed in acknowledgements 5 \n*Joint senior authors 6 \nAffiliations: 7 \n1. Modernising Medical Microbiology Unit, Nuffield Department of Medicine, 8 \nUniversity of Oxford, Oxford, UK 9 \n2. Antimicrobial Resistance (AMR) and Healthcare Associated Infections (HCAI) 10 \nDivision, Chief Medical Advisor’s Group, UKHSA, UK 11 \n3. Diagnostic Accelerator, Diagnostics and Pathogen Characterisation, UKHSA 12 \nPorton Down, UK 13 \n4. NIHR Health Protection Research Unit in Healthcare-associated Infection and 14 \nAntimicrobial Resistance, Nuffield Department of Medicine, University of Oxford, 15 \nOxford, UK 16 \n5. Antimicrobial Resistance and Healthcare Associated Infections (AMRHAI) 17 \nReference Unit, Public Health Microbiology - Reference Microbiology Division, 18 \nChief Scientific Officer’s Group, UKHSA, UK 19 \n6. NIHR Oxford Biomedical Research Centre, Oxford, UK 20 \n7. Oxford University Hospitals NHS Foundation Trust, Oxford, UK 21 \n 22 \nCorresponding author:  23 \nDorottya Nagy (dorottya.nagy@ndm.ox.ac.uk) 24 \nKeywords: Bacterial genomics, Escherichia coli, Klebsiella spp., long-read sequencing, 25 \ngenome assembly 26 \nRepositories: Long and short-read sequencing data has been deposited in ENA 27 \n(BioProject accession: PRJEB93885). Code used for bioinformatic and statistical 28 \nanalyses has been uploaded to GitHub 29 \n(https://github.com/oxfordmmm/NEKSUS_ont_hybrid_assembly_comparison). 30 \nSummary data files have been uploaded to FigShare 31 \n(https://figshare.com/account/home#/projects/253775). 32 \n \nNanopore long-read only genome assembly of clinical \nEnterobacterales isolates is complete and accurate \n \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nAbstract 33 \nWhole bacterial genome sequence reconstruction using Oxford Nanopore Technologies 34 \n(“Nanopore”) long-read only sequencing may offer a lower-cost, higher-throughput alternative 35 \nfor pathogen surveillance to ‘hybrid’ assembly with recent improvements in Nanopore 36 \nsequencing accuracy. We evaluated the accuracy, including plasmid reconstruction, of 37 \nNanopore long-read only genome assemblies of Enterobacterales. 38 \nWe sequenced 92 genomes from clinical Enterobacterales isolates, collected in 39 \nEngland under a national surveillance program, with long-read Nanopore (R10.4.1, Dorado 40 \nv5.0.0 super-high-accuracy basecalled) and short-read Illumina (NovaSeq) sequencing 41 \napproaches. Genomes were assembled using three long-read only (Flye; Hybracter long; 42 \nAutocycler), and three hybrid assemblers (Hybracter hybrid; Unicycler normal; bold). Three 43 \npolishing modalities (Medaka v2 with subsampled or un-subsampled long-reads; Polypolish + 44 \nPypolca with short-reads) were investigated.  45 \nAutocycler circularised the most chromosomes (87/92 [95%]). Plasmid sequence 46 \nreconstruction was comparable between all assemblers except Flye, all recovering 90-96% of 47 \nplasmids, although the ‘ground truth’ was uncertain. Flye performed worse than other 48 \nassemblers on almost all metrics. Autocycler + Medaka (un-subsampled long-reads) was the 49 \nmost accurate long-read only assembler/polisher combination, comparable to hybrid 50 \nassemblies (median 0 [IQR:0-0] SNPs and 0 [IQR:0-1] indels per genome; quality value/Q score, 51 \n100 [IQR: 64-100]), with only 4/92 genome sequences having >10 SNPs/indels. Medaka 52 \npolishing with un-subsampled long-reads resulted in small improvements in indels but not 53 \nSNPs for both Flye and Autocycler assemblies. Seven-locus MLST, antimicrobial resistance, 54 \nvirulence, and stress gene annotation was equivalent across assembler/polisher combinations.  55 \nNanopore long-read only bacterial genome assembly with Autocycler combined with 56 \nMedaka polishing (using un-subsampled reads) is similarly accurate and possibly more 57 \ncomplete than hybrid assemblies, representing a viable alternative for incorporating high-58 \nquality genomic data, including plasmids, into Enterobacterales surveillance.  59 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nData Summary 60 \nNanopore long-reads and Illumina short-reads from the 92 Enterobacterales isolates 61 \nfrom this study have been uploaded to ENA (BioProject accession: PRJEB93885). Code for the 62 \nNextflow assembly pipeline, downstream analysis scripts, and R statistical analysis scripts are 63 \navailable on GitHub 64 \n(https://github.com/oxfordmmm/NEKSUS_ont_hybrid_assembly_comparison). The following 65 \nsupplementary data tables are available on FigShare 66 \n(https://figshare.com/account/home#/projects/253775): 67 \n• ENA Sample accessions and sample metadata (accessions_and_metadata.csv) 68 \n• Seqkit stats summaries of the Illumina and Nanopore reads (raw_qc_sup.cav) 69 \n• Summary of assembly contig features (contigs_summary_sup_cleaned.csv) 70 \n• Pairwise mash distances between contigs (mash_cleaned.csv) 71 \n• Plasmids matching across different assemblers compared to the Hybracter (hybrid) 72 \nand manually-curated reference sets (plasmids_match_hybracter_mash.csv; 73 \nplasmids_match_manual_mash.csv, respectively) 74 \n• Seven-locus multi-locus sequence type annotation (mlst_cleaned.csv) 75 \n• CheckM2 summaries of assemblies (checkm2_cleaned.csv) 76 \n• Nucleotide-level accuracy of assemblies (SNP , Indels, and Quality value compared 77 \nto short-read mapping; assembly_nucleotide_accuracy_cleaned.csv) 78 \n• Bakta annotation (bakta_by_contig_cleaned.csv) 79 \n• AMRFinderPlus annotations of contigs (amrfinder_plus_cleaned.csv) 80 \n• MOB-suite annotation summaries of contigs (mobsuite_cleaned.csv) 81 \nImpact Statement 82 \nNanopore long-reads have historically been too error-prone to use alone for accurate 83 \nbacterial genome assembly, necessitating additional Illumina short-reads to achieve 84 \nstructurally complete and accurate ‘hybrid’ genome assemblies for public health surveillance. 85 \nThis increases cost and complexity. Previous studies have shown that recent improvements in 86 \nNanopore chemistry (R10.4.1 flowcell) and basecalling (super-high accuracy) allow high-quality 87 \nlong-read only assemblies on a small number of laboratory reference strains.  This is the first 88 \nevaluation, to our knowledge, to assess Nanopore long-read only genome assembly compared 89 \nwith hybrid assembly on a large number of clinical isolates. In addition, this is the first large-90 \nscale evaluation of the recently released automated consensus long-read assembly tool, 91 \nAutocycler.  92 \nWe show that Autocycler long-read only assemblies are more structurally complete for 93 \nchromosomal sequences, while reconstructing a similar number of plasmids to other long-read 94 \nand hybrid assemblers. Most long-read polished, Autocycler-assembled genome sequences 95 \nhave 0 errors (median: 0 SNPs/indels) relative to a short-read polished (hybrid) Autocycler 96 \nassemblies, enabling accurate annotation of key genes.   97 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nIntroduction 98 \nHybrid assembly combining short- and long-read genomic sequencing is widely used in 99 \nresearch to assemble complete and accurate bacterial genome sequences. Incremental 100 \nimprovements in Nanopore flowcells/chemistry (10.4.1 flowcell/kit 14) and basecalling 101 \naccuracy (Dorado v5.0.0 super-high accuracy DNA model)(1-5) have been shown in small-scale 102 \nevaluations to facilitate  long-read only assemblies that may now be comparable in accuracy to 103 \nhybrid assembly(6, 7). Nanopore-only sequencing may also offer advantages over hybrid 104 \nsequencing, including cost effectiveness, real-time data generation and decentralised 105 \nimplementation(8, 9). 106 \nHighly accurate bacterial genome reconstruction, with minimal noise from sequencing 107 \nartefact, is key for identifying closely-related clusters of isolates and plasmids for outbreak 108 \ndetection(10). Accurate reconstruction of mobile genetic elements (MGEs) such as plasmids in 109 \nparticular, is clinically and epidemiologically important as plasmids are common transmission 110 \nvectors for antimicrobial resistance (AMR) genes in clinically-relevant Enterobacterales(11, 12). 111 \nLong-read or hybrid assembly approaches can facilitate plasmid sequence reconstruction and 112 \ntherefore analysis of AMR gene epidemiology compared to short-reads, which may not be able 113 \nto resolve highly repetitive sequences often associated with MGEs(13, 14). Nevertheless, 114 \nNanopore-only genome assembly accuracy has only been validated for a small number of 115 \nreference bacterial isolates(15, 16), and has not yet been assessed on a large collection of 116 \nclinical isolates, including for plasmids as well as chromosomes. This may be important 117 \nbecause of the reliance of long-read basecalling models on training datasets of unknown size 118 \nand diversity, whose performance may therefore generalise poorly to clades not included in 119 \nthese training datasets. Similarly, although best-practice assembly guidelines have been 120 \nproposed(6, 17, 18), multiple long-read assembly pipelines implement these guidelines with 121 \nslight variations(16, 19-22), and no robust consensus exists, particularly regarding the optimal 122 \nstrategy for plasmid assembly. 123 \nIn this study, we comprehensively evaluated the completeness and accuracy of 92 124 \nNanopore long-read only  assemblies (with and without polishing) compared to hybrid assembly 125 \nin reconstructing both chromosomes and plasmids using  isolates collected in The National 126 \nEscherichia coli and KlebSiella spp. bloodstream infection (BSI) and Carbapenemase-127 \nproducing Enterobacterales (CPE) UK Surveillance (NEKSUS) study.  128 \nMethods 129 \nIsolate collection  130 \nNine English NHS Trusts (groups of hospitals under the same administration) 131 \nrepresenting the largest in terms of number of emergency admissions across all seven NHS 132 \nEngland regions were recruited to the NEKSUS consortium. Consecutive, unselected BSI and 133 \nCPE-positive rectal screening isolates were collected between October 2023 and March 2024 134 \nas part of routine clinical practice. One convenience sample of the first 96 Enterobacterales 135 \nisolates collected, mostly E. coli and Klebsiella spp. (Table S1), sequenced from three regions, 136 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nwere included in this analysis as our isolates were sequenced in batches of 96. Isolates were 137 \nstored in brain-heart infusion (BHI) broth with 10% glycerol at -70C, then grown on blood agar 138 \nfor 24h at 37C, following which a colony sweep of the pure bacterial culture was suspended in 139 \n1 ml phosphate buffer saline, pelleted, and cold-packed. Bacteria were subcultured for a further 140 \n24h at 37C where there was insufficient growth after 24h.   141 \nDNA extraction and sequencing 142 \nDNA extraction, library preparation and sequencing were conducted at GENEWIZ 143 \nGermany GmbH (Leipzig, Germany). DNA was extracted using the MagMAX Microbiome Ultra 144 \nNucleic Acid Isolation Kit with bead plate (Life Technologies, Carlsbad, CA, USA). Genomic DNA 145 \nwas quantified using the Qubit 4.0 Fluorometer and qualified using the Agilent 5600 Fragment 146 \nAnalyzer. The same DNA extract was sequenced by both methods.  147 \nFor Nanopore sequencing the Rapid Barcoding Kit 96 V14 (Oxford Nanopore 148 \nTechnologies, Oxford, UK) was used according to the manufacturer's recommendations. Briefly, 149 \nsequencing libraries were generated using a transposase, which simultaneously cleaves 150 \ntemplate molecules and attaches barcoded tags to the cleaved ends. The barcoded samples 151 \nwere then pooled (96-plexed) before solid phase reversible immobilisaton (SPRI)-cleaning and 152 \naddition of Rapid Adapters to the tagged ends. The library pools were loaded onto ONT 153 \nPromethION flow cells (R10 [M Version]) – one 96-plex pool per flow cell – and sequenced on a 154 \nPromethION P2 Solo for 72 hours according to the manufacturer's instructions. 155 \nFor Illumina sequencing the NEBNext Ultra II DNA Library Prep Kit for Illumina (New 156 \nEngland Biolabs, Ipswich, MA, USA), including clustering and sequencing reagents, was used 157 \naccording to manufacturer's recommendations. Briefly, the genomic DNA was fragmented by 158 \nacoustic shearing with a Covaris LE220 instrument. Fragmented DNA was cleaned up and end 159 \nrepaired. Adapters were ligated after adenylation of the 3’ ends followed by enrichment by 160 \nlimited cycle PCR. DNA libraries were validated using the Agilent TapeStation (Agilent 161 \nTechnologies, Palo Alto, CA, USA), and were quantified using a Qubit 4.0 Fluorometer. The 162 \nlibraries were multiplexed on a flowcell and loaded on the Illumina NovaSeq X Plus instrument 163 \naccording to manufacturer's instructions. The samples were sequenced using a 2x150bp 164 \npaired-end (PE) configuration. Raw sequencing data (.bcl files) generated from Illumina 165 \nNovaSeq were converted into fastq files and de-multiplexed using Illumina's bcl2fastq(23) v2.20 166 \nsoftware. 167 \nBioinformatic analysis 168 \nComputational analysis was performed on a virtual machine in the Oracle Cloud 169 \nInfrastructure. POD5 files were basecalled and demultiplexed using Dorado(24) v5.0.0 (super 170 \nhigh accuracy 5mCG, 5hmCG and 6mA methylation aware simplex DNA model). All 171 \nbioinformatic tools were run using default settings unless otherwise specified. Raw-read quality 172 \nwas evaluated with SeqKit(25) v2.9.0. Long-reads were randomly subsampled to 60x using the 173 \nbuilt-in subsampling and genome size estimation scripts from Autocycler(20) v0.2.1, and short-174 \nreads were randomly subsampled to 100x (50x for each paired-end read) with 175 \nRasusa(26) v2.1.0. Genome sequences were assembled using three long-read only assemblers 176 \n(Flye(27) v2.9.5, Hybracter(19) (long) v0.11.2, the consensus assembler Autocycler(20) v0.2.1), 177 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nand three hybrid assemblers (Hybracter(19) (hybrid), Unicycler(7) v0.5.1 (normal and bold 178 \nmodes)). The input long-read assemblies used for Autocycler were four assemblies each of 179 \nCanu(28) v2.2, Flye(27), Raven(29) v1.8.3, Miniasm(30) v0.3, and Hybracter(19) (long) (which 180 \nincorporates the plasmid assembly tool Plassembler(31)), where each of the four assemblies 181 \nwas derived from a randomly subsampled set of reads. The Flye and Hybracter (long) 182 \nassemblies from the first subsampled read set were used in downstream analyses. Three 183 \npolishing modalities were investigated: long-read polishing with one round of Medaka(32) v2.0.1 184 \nusing 1) subsampled long-reads, 2) un-subsampled long-reads, or 3) short-read polishing with 185 \nPolypolish(33) v0.6.0 and Pypolca(17, 34) v0.3.1 (‘--careful’ flag; Fig. 1). 186 \nAssembly quality control 187 \nQuality control of assemblies was done using SeqKit(25) stats and CheckM2(35-37) 188 \nv1.0.2, excluding isolates where any assembly for that isolate had <99% completeness and/or 189 \n>5% contamination. 4/96 (4%) isolates had >5% ‘contamination’ based on the checkM2 output, 190 \nlikely corresponding to mixed isolate sequences (i.e. not pure cultures), so were excluded from 191 \nsubsequent analyses. The remaining 92/96 (96%) pure-culture isolates passed the 192 \ncompleteness threshold.  193 \nAssembly annotation 194 \nAssemblies from all 12 assembler/polisher combinations were annotated using 195 \nBakta(38) v1.10.4, 7-locus MLST (mlst(39) v2.23.0), AMRFinderPlus v4.0.3 (species flag inferred 196 \nfrom Kraken2(40) v2.1.3) and MOB-suite(41, 42) v3.1.9 (mob_recon and mob_typer).  197 \nChromosome evaluation 198 \nAssemblies from the six different assemblers (without polishing) were evaluated for 199 \nstructural completeness of chromosomes and plasmids, as polishing is not expected to alter 200 \nstructure. Chromosomes were considered 'fully reconstructed' if the chromosomal contig was 201 \n>4Mb and circularised.  202 \nPlasmid evaluation 203 \nContigs ≥1,000bp and ≤400,000bp in length were considered potential plasmids. Mash 204 \ndistances between all potential pairwise plasmid combinations were calculated using Mash(43, 205 \n44) v2.3 (k-mer size = 21, sketch size 10,000,000).  206 \nPlasmid reconstruction was assessed by comparing with two alternative ‘reference’ 207 \nplasmid sets generated from the assembly data in this study, due to the absence of a ‘ground 208 \ntruth’ for these isolates. The first ‘reference’ plasmid set included all circular potential plasmids 209 \nrecovered by Hybracter (hybrid), which incorporates the plasmid assembly tool 210 \nPlassembler(31), recommended in best-practice assembly guidance(6).  The second ‘reference’ 211 \nplasmid set was created using a manually-curated consensus approach considering all six 212 \nassemblies for each isolate. This latter manually-curated reference set was constructed by 213 \nmatching each potential plasmid contig from the six assembly methods to its most similar 214 \ncontig from each other assembler based on mash distance, forming a network with all pairwise 215 \nassembler combinations. The R package igraph(45, 46) v2.1.4 was used to extract connected 216 \ncomponents (sub-networks within each sample with at least one mash-distance connection 217 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nbetween nodes). Each connected component was assigned a ‘match-set’ ID. Three (out of 303) 218 \nmatch-sets (connected components) contained more than one contig per assembler, and were 219 \ncorrected manually (two were likely partial plasmids and one was likely a chimeric Unicycler 220 \n(bold) plasmid that joined two separate plasmid match-sets together; data not shown). ‘True’ 221 \nmatch sets were retained in the manually-curated reference set where at least two assemblers’ 222 \ncontigs were present, circular, of similar length (±10%) and had a low mash distance (<0.025). 223 \nThe 0.025 mash distance threshold reflects the highest possible mash distance between draft 224 \nand complete plasmid assemblies of the same plasmid from the original MOB-suite 225 \npublication(42).  226 \nPlasmid reconstruction for each assembler was then evaluated, for the Hybracter 227 \n(hybrid) reference set, by matching potential plasmid contigs to each reference plasmid set 228 \nbased on circularity (i.e. circular or linear), length (±10%), and mash distance (<0.025). 229 \nPlasmids were ‘present’ if all three match criteria were met, or ‘misassembled’ if at least one of 230 \nthe criteria were not met. Plasmids were ‘absent’ if none of the criteria were met, if only the 231 \ncircularity matched (but not length or mash distance), or if no contig from an assembler could 232 \nbe matched to that set. For the manually-curated reference set, where no single reference 233 \nplasmid was available, mash distance and length similarity criteria were fulfilled if an 234 \nassembler’s plasmid matched more than half of the other plasmids in a match set (see 235 \nsupplementary data file plasmids_mash_manual_mash.csv).  236 \nNucleotide-level accuracy 237 \nNucleotide-level accuracy was assessed in a reference-free manner by aligning Illumina 238 \nshort-reads to the 12 assembler-polisher combinations using the Pypolca(17) in-built read 239 \naligner and variant caller (BWA(47) 0.7.18 and Freebayes(48) v1.3.6). Single nucleotide 240 \nsubstitutions (SNPs), short insertions/deletions (indels) and quality value (QV) were extracted 241 \nfrom the .vcf output file. QV, like Phred score, is a measure of accuracy where higher QV signifies 242 \na more accurate consensus (QV = -10 * log10(probability of error), where a 0-error probability 243 \ntakes the value of Q100). Mean gene length was extracted from CheckM2(37) as a further 244 \nmeasure of accuracy. Errors may introduce premature stop codons and are thus expected to 245 \nreduce the length of coding sequences(38).  246 \nStatistical analyses and visualisations 247 \nStatistical analysis and visualisation were done in R(49)  v4.4.1 using ggplot2(50) v3.5.1 248 \nand other tidyverse(51) v1.3.1 functions, gridExtra(52) v2.3, cowplot(53) v1.1.3, psych(54) v2.5.6 249 \nand irr(55) v0.84.1 packages. Global test for uneven proportions in categorical variables was 250 \ndone using the multiple-group Fleiss’ Kappa test, and for continuous variables, with a Friedman 251 \ntest to account for non-independence between different assemblers’ ‘observations’ on the 252 \nsame isolate. Pairwise test between assemblers for differences in proportions were done using 253 \nMcNemar’s Χ2-test with continuity correction and for differences in counts, with Wilcoxon 254 \nsigned-rank tests. A Bonferroni correction was applied to all pairwise testing to account for 255 \nmultiple testing. An exact binomial test was used to test for a significant difference to 1 for the 256 \nproportion of plasmids reconstructed compared to the Hybracter (hybrid) reference set. 257 \nClinker(56) v0.0.31 was used to visualise plasmid alignments using the Bakta(38) v1.10.4 258 \nannotated .gbff files.  259 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nFigure 1: Schematic diagram of assembly, polishing and downstream analysis pipeline.  260 \n  261 \nNanopore  \nlong-reads \nseqkit stats \n Subsampling x1 \n(Rasusa) \nQC and Subsampling \nAssembly \nPolishing \nIllumina  \nshort-reads \nSubsampled short-reads  \nSubsampling x4  \n(Autocycler scripts) \nFlye 02 \nFlye 03  \nFlye 04  \nRaven 02 \nRaven 03  \nRaven 01 \nRaven 04  \nMinasm 02 \nMinasm 03  \nMinasm 01 \nMinasm 04  \nHybracter (long) 02 \nHybracter (long) 03  \nHybracter (long) 01 \nHybracter (long) 04  \nCanu 02 \nCanu 03  \nCanu 01 \nCanu 04  \nHybracter (hybrid)  \nUnicycler (normal)  \nUnicycler (bold)  \nAutocycler  \nSubsampled long-reads 01 \nSubsampled long-reads 02\nSubsampled long-reads 03  \nSubsampled long-reads 04  \n \nFlye \n(unpolished) \nFlye + Medaka \n(subsampled) \n  \nFlye + Medaka \n(un-subsampled) \nAutocycler \n(unpolished) \nAutocycler + Medaka \n(un-subsampled) \nAutocycler + \nPolypolish + Pypolca \nFlye + Polypolish + \nPypolca \nFlye 01 \nStructure \nevaluation: \n- Chromosome \n- Plasmids \n-   MOB-suite \n-   Mash \nAccuracy \nevaluation: \n- SNPs \n- Indels \n- CheckM2 \n- MLST \n- AMRFinder  \n- Bakta \n \nAutocycler + Medaka \n(subsampled) \nMedaka \nMedaka \nPolypolish + Pypolca \nMedaka \nMedaka \nPolypolish + Pypolca \nseqkit stats \nRaw QC \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nResults  262 \nRaw sequences  263 \nHigh sequencing depth and quality was achieved for both Illumina short- and Nanopore long-264 \nreads 265 \nOver 200x sequencing depth was achieved for both Illumina and Nanopore reads (Table 266 \nS2). Median long-read length was 5814bp (IQR: 5366-6338), and median estimated Phred 267 \nquality score was 16.6 (IQR: 16.4-16.8). Subsampling did not affect median read length or read 268 \nquality (Table S2; Fig. S1). 269 \nStructural completeness 270 \nChromosome reconstruction was optimal using the consensus long-read only assembler, 271 \nAutocycler 272 \nAutocycler circularised the most chromosomal sequences, 95% (87/92), significantly 273 \nmore than Unicycler (80% [74/92], pairwise McNemar’s p=0.006), Unicycler bold (85% [78/92], 274 \np=0.039) and Flye (85% [78/92], p=0.027), Hybracter (hybrid) (86% [79/92], p=0.043), while there 275 \nwas no statistical evidence of a difference to Hybracter (long) (87% [80/92], p=0.070; Table 1; 276 \nFig.2a). Notably, for two isolates that were correctly assembled by all other assemblers, 277 \nAutocycler failed to generate a circular consensus chromosome (Fig. 2a), producing highly 278 \nfragmented draft assemblies instead. 279 \nPlasmid reconstruction was improved by Autocycler or Hybracter compared with Flye 280 \nGiven the absence of a ‘ground truth’ for plasmids in the sequenced isolates, we 281 \nconsidered two ‘reference’ plasmid sets generated from the assembly data. The first was the 282 \nHybracter (hybrid) reference set, and the second, a manually-curated reference set considering 283 \npotential plasmids across all assemblers.  All plasmids from the Hybracter (hybrid) reference 284 \nset (n=278) were present in the manually-curated set. However, the manually-curated set 285 \nincluded an additional 25 plasmids (total 303 vs 278 plasmids), which were missing from the 286 \nHybracter (hybrid) reference set, mostly due to being non-circular (17/25, 68%), or non-circular 287 \nand of different length (3/25, 12%), while 5/25 (20%) plasmid sets could not be matched to any 288 \nHybracter (hybrid) contigs not already in another match set (all pairwise mash distances >0.2; 289 \nTable S3). 290 \nCompared with the Hybracter (hybrid) reference set, Flye reconstructed significantly 291 \nfewer plasmids than all the other assemblers (56% [156/278]; exact binomial test p<0.0001 vs 292 \n100% reconstructed by Hybracter (hybrid) and McNemar’s p<0.0001 vs Autocycler, Hybracter 293 \n(long), Unicycler, and Unicycler (bold)).  Among the remining assemblers, 93-96% of plasmids 294 \nwere reconstructed, which was significantly fewer than 100% of the Hybracter (hybrid) 295 \nreference set (all exact binomial test p<0.0001; Table 1; Fig. 2b). There was no evidence of a 296 \ndifference between the 96% (267/278) of plasmids reconstructed by Autocycler compared to 297 \nthe other assemblers besides Flye (Hybracter (long) 96% [268/278], McNemar’s p=1 vs 298 \nAutocycler, Unicycler 96% [266/278], p=1 and Unicycler (bold) 93% [258/278], p=0.095). 299 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nSimilarly, compared with the manually-curated reference set, Flye reconstructed 300 \nsignificantly fewer plasmids than all other assemblers (55% [166/303]; pairwise McNemar’s 301 \np<0.0001 vs each of the five other assemblers). Flye more frequently missed or misassembled 302 \nsmall, <10,000bp, plasmids (Fig. 2c; S2b), and incorrect length was the most common reason 303 \nfor Flye plasmid misassembly (Table 1; S2). Among the remaining assemblers, 90-94% of 304 \nplasmids were reconstructed compared to the manually-curated reference set. Autocycler 305 \nreconstructed 94% (285/303) of plasmids, significantly more than Hybracter (long) (90% 306 \n[272/303]; McNemar’s p=0.014). However, there was no evidence of a difference between the 307 \nnumber of plasmids reconstructed by Autocycler compared to the other assemblers: Hybracter 308 \n(hybrid) (91% [276/303]; McNemar’s p=0.066 vs Autocycler), Unicycler (93% [282/303]; p=1), or 309 \nUnicycler (bold) (90% [274/303]; p=0.296; Table S3; Fig. S2a).  310 \nOf the 10 Autocycler plasmids with a mash distance of 0 to the corresponding Hybracter 311 \n(hybrid) plasmid, 2/10 had a missing MOB-suite IncFIC replicon annotation despite identical 312 \nsequence (Fig. S3). In both cases, the Autocycler plasmid was reversed (i.e. the reverse 313 \ncomplement strand was represented in the fasta file) compared with the other plasmids. The 314 \nFlye plasmid sequence was also missing an IncFIC annotation in one of these two plasmids; 315 \nhowever, this difference was not observed in the other 232 plasmids across other assemblers 316 \nwith a mash distance of 0 to the Hybracter (hybrid) reference. 317 \nTable 1: Chromosomal sequence circularisation and accuracy of plasmid sequence 318 \nreconstruction for different assemblers using Dorado v5.0.0 super-high accuracy 319 \nbasecalled Nanopore long-reads. Plasmid sequence reconstruction was compared with the 320 \nHybracter (hybrid) plasmid reference dataset, defined as circular contigs ≤400,000bp and 321 \n≥1,000bp assembled by Hybracter (hybrid)(n=278) across 92 Enterobacterales isolates 322 \nanalysed; the denominator for plasmids was therefore 278 throughout. 323 \n \n \nAssembler  \nAutocycler \n \nn (%) \nFlye \n \nn (%) \nHybracter \n(long) \nn (%) \nHybracter \n(hybrid) \nn (%) \nUnicycler \n \nn (%) \nUnicycler \n(bold) \nn (%) \np-value† \nChromosomes circularised (N=92) 87 (94.6%) 78 (84.8%) 80 (87.0%) 79 (85.9%) 74 (80.4%) 78 (84.8%) <0.0001 \nPresent* plasmids (N=278) \n267  \n(96%) \n156 \n(56.1%) \n268 \n(96.4%) \n278  \n(100%) \n266 \n(95.7%) \n258 \n(92.8%) \n0.002 \nMisassembled** plasmids (N=278)       \n  Non-circular 0 (0%) 16 (5.8%) 2 (0.7%) 0 (0%) 4 (1.4%) 2 (0.7%)  \nLength mismatch 1 (0.4%) 41 (14.7%) 1 (0.4%) 0 (0%) 1 (0.4%) 1 (0.4%)  \nMash distance >0.025 0 (0%) 1 (0.4%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)  \nNon-circular & length mismatch 0 (0%) 17 (6.1%) 2 (0.7%) 0 (0%) 3 (1.1%) 6 (2.2%)  \nAbsent plasmids (N=278) 10 (3.6%) 47 (16.9%) 5 (1.8%) 0 (0%) 4 (1.4%) 11 (4%)  \n*’Present’ plasmids are defined as contigs meeting all three match criteria: circular, length within 10% and mash 324 \ndistance <0.025 of a Hybracter hybrid reference plasmid. 325 \n**Misassembled plasmids are defined as contigs that failed to meet at least one of the matching criteria, or were 326 \nnon-circular and a different length (>10% difference).   327 \n†p-value for Fleiss’ Kappa test for uneven proportions of circularised chromosomes or ‘present’ plasmids across all 328 \nassemblers.    329 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nFigure 2: Structural completeness of 92 pure culture Enterobacterales genome sequences assembled by \ndifferent long-read only and hybrid assemblers. Genome sequences were assembled using Dorado v5.0.0 \nsuper-high accuracy basecalled Nanopore long-reads, plus Illumina short-reads for hybrid assembly. a) \nNumber and percentage of isolates with a fully circularised chromosome (dark-coloured tiles) or an \nincompletely circularised chromosome (light cream tiles) by assembler. b) Upset plot of plasmid assembly \nstatus combinations across assemblers. Plasmid sequence reconstruction (assembly status) is compared to a \nHybracter (hybrid) plasmid reference dataset, defined as circular contigs ≤400,000bp and ≥1,000bp assembled \nby Hybracter (hybrid)(n=278) across the 92 Enterobacterales isolates analysed. Dark circles represent \n‘present’ plasmids where length (±10%), mash distance (<0.025) and circularity all matched the Hybracter \n(hybrid) ‘reference’ plasmid, lighter colours indicate misassembled plasmids, where the length difference was \n>10%, mash distance >0.025, or the contig was non-circular and the palest shades indicate absent plasmids, \nwhere no contig was found matching other plasmids in the reference plasmid set. c) Frequency polygon of \nlength distribution of ‘present’ plasmids by assembler.  \na) \n  \n \nc) \n \n \nb) \n \n  \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nAssembly accuracy  330 \nUnpolished Autocycler assemblies are more accurate than non-consensus long-read 331 \nassemblers, while differences compared with hybrid assemblers are small  332 \nAutocycler was the most accurate long-read only assembler, with 37% of unpolished 333 \nassemblies (34/92) having 0 SNPs or indels when compared with 11% (10/92) for unpolished 334 \nFlye and 7% (6/92) for Hybracter (long). For unpolished Autocycler, this equated to a median of 0 335 \nSNPs/Mb (IQR: 0-0.17) and 0.18 indels/Mb (IQR:0-0.39), and a median quality value (QV) of Q67 336 \n(IQR:63-100; Fig. 3a-c; Table S4). The differences in accuracy between unpolished Autocycler, 337 \nunpolished Flye or Hybracter (long) were significant (pairwise Wilcoxon signed rank p<0.0001 338 \nfor SNPs, indels and QV), while there was no evidence of a difference in accuracy between 339 \nunpolished Autocycler and Unicycler (normal or bold mode; p=1 for all metrics). There was no 340 \nevidence of a difference between Flye and Hybracter (long) assemblies (Fig. 3a-c; Table S4).  341 \nMedaka long-read polishing offers small improvements in accuracy for long-read assemblies, 342 \nalthough short-read polishing is still marginally more accurate 343 \n Medaka long-read polishing (with un-subsampled reads) improved accuracy for 344 \nAutocycler and Flye by improving QV and reducing indels (from median Q67 to Q100 [Wilcoxon 345 \nsigned rank p=0.007], and Q61 to Q67 [p<0.0001], and 0.18 indels/Mb to 0 [p=0.006], and 0.57 346 \nindels/Mb to 0.17 [p<0.001], respectively), but there was no evidence of reducing SNPs (p=1 for 347 \nboth Autocycler and Flye). There was some statistical evidence that Medaka long-read polishing 348 \nusing un-subsampled long-reads was marginally better at reducing indels for Autocycler 349 \nassemblies than using subsampled reads (change vs Autocycler of median 0 indels/Mb [IQR: -350 \n0.19-0; range: -1.64-3.61] for un-subsampled reads, compared to a change of 0 [IQR: -0.18-0; 351 \nrange: -1.09-7.60] indels/Mb, Wilcoxon signed rank p=0.019; Fig.3; Table S3). However, this very 352 \nsmall difference is not reflected in the medians/IQR of indels/Mb as most isolates had 0 indels 353 \n(57% [52/92] for Autocycler + Medaka [subsampled] and 65% [60/92] for Autocycler + Medaka 354 \n[un-subsampled]).  355 \nShort-read polished Autocycler assemblies were more accurate than the best long-read 356 \npolished Autocycler assemblies (Autocycler + Medaka [un-subsampled]) (change vs unpolished 357 \nAutocycler of median 0 [IQR: -0.16-0] SNPs/Mb, -0.18 [-0.39-0] indels/Mb, and Q32.6 (Q0-358 \nQ35.9) for short-read polishing vs median change 0 [0-0] SNPs/Mb, 0 [-0.19-0] indels/Mb, and 359 \nQ0 (Q0-Q6.15) for Medaka (un-subsampled) polishing, pairwise Wilcoxon signed rank p=0.0002, 360 \np<0.0001 and p<0.0001, respectively; Fig 3; Table S4). However, the absolute difference was 361 \nsmall, and affected only the worst-performing quartile of isolates. The majority, 55% (51/92), of 362 \nAutocycler + Medaka (un-subsampled reads) polished assemblies had 0 errors (QV100), and 363 \nonly 4% (4/92) of genome sequences had >10 SNPs or indels in the entire assembly, compared 364 \nwith 95% (87/92) of short-read polished Autocycler assemblies having 0 errors and two genome 365 \nsequences with >10 SNPs or indels (Figs.3a-c; Table S4). 366 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nMean gene length is slightly shorter for Flye assemblies, and is not corrected by long or short-367 \nread polishing 368 \nMean gene length was assessed as a further measure of accuracy, as small errors can 369 \nresult in coding sequence truncation, and shorter average gene length. While there was some 370 \nstatistical evidence of a difference in mean gene length between different assembler/polisher 371 \ncombinations, with unpolished and long-read polished Flye assemblies having a slightly shorter 372 \nmean gene length compared to other assembler (Friedman’s p<0.0001; all pairwise Wilcoxon 373 \nsigned rank p<0.0001-p=0.01 compared to all other assemblers), the difference was small in 374 \nmagnitude (median of the mean gene length across all isolates of 312bp [IQR: 308-315bp] for 375 \nFlye + Medaka (subsampled) polishing, vs 312bp [309-316bp] for all other non-Flye assemblers; 376 \nFig. 3d).  377 \nGene annotation for MLST loci, resistance, virulence and stress genes is equivalent for long-read 378 \nand hybrid assemblies  379 \nThere was no evidence of a difference in the numbers of key resistance, virulence and 380 \nstress genes identified by AMRFinder Plus in assemblies generated by any assembler/polisher 381 \ncombination (Friedman’s p=0.209 for resistance, p=0.736 for virulence, and p=0.687 for stress 382 \ngenes; all pairwise Wilcoxon signed-rank p=1; Table S4). There was high concordance between 383 \nassemblers on the presence/absence of specific gene variants (all pairwise McNemar’s 384 \np>0.209). There was also no evidence of a difference in the proportion of isolates with correctly 385 \nassigned multi-locus sequence type (MLST; all pairwise McNemar’s p=1, Table S4). Hybracter 386 \n(long; hybrid), Unicycler (normal; bold), and polished Flye assemblies were annotated with 387 \nidentical MLST-types for all 91 isolates belonging to a species with available MLST-typing 388 \nschemes (i.e. all isolates except one Serratia marcescens). A single locus in one isolate was 389 \n‘uncertain’ for the unpolished Flye assembly ((gapA(~2)), and another locus (gyrB(10)) was 390 \nduplicated in a different isolate amongst Autocycler assemblies. Polishing did not correct this 391 \nduplicated annotation, although the allele was correctly identified.  392 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nFigure 3: Assembly accuracy for different assembler/polisher combinations. a) Single nucleotide \nsubstitution errors (SNPs) and b) insertion/deletions (indels) identified by re-aligning Illumina short-reads, c) \nquality value as annotated by Freebayes(48) from Pypolca(17) and d) mean gene length from CheckM2(37) of \n12 different assembler/polisher combinations. The y-axes in a), b) and c) are transformed using a pseudo-log \nscale to facilitate plotting zero values given log(0) is undefined. \na) \n \n \nb) \n \nc)  \n \n \nd) \n \n \n \n  393 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nDiscussion 394 \nWe evaluated three long-read only bacterial genome assemblers, three hybrid 395 \nassemblers, and three polishers on 92 clinical Enterobacterales isolates.  The consensus long-396 \nread assembler, Autocycler, produced the most structurally complete assemblies, circularising 397 \n95% of chromosomes. Plasmid reconstruction was comparable between all assemblers except 398 \nFlye, which underperformed compared with other assemblers for most metrics. Autocycler with 399 \nMedaka polishing was the most accurate long-read only assembler/polisher combination, with 400 \na median of 0 SNPs/indels compared to what we consider the ‘gold-standard’ hybrid assembly 401 \n(i.e. short-read polished Autocycler assemblies). Long-read polishing of Autocycler and Flye 402 \nassemblies offered small improvements in accuracy compared to unpolished assemblies, 403 \nalthough short-read polishing still corrected marginally more errors. There was strong 404 \nagreement in the annotation of seven-locus MLSTs, resistance, virulence and stress genes, and 405 \nmean gene length across all assemblers. 406 \nIt is not surprising that long-read assemblers circularise more chromosomes, as long-reads 407 \ncan resolve repetitive regions that short-reads may not. This explains why the long-read first 408 \nhybrid assembler, Hybracter (hybrid), performed more similarly to other long-read assemblers 409 \nthan Unicycler, which uses short-reads first to reconstruct overall structure. The ability of 410 \nAutocycler to circularise eight chromosomes where non-consensus assemblers failed supports 411 \nthe utility of this software(57). Combining 20 input assemblies in Autocycler may reduce the 412 \neffects of stochastic variation in individual assemblers. The 2/92 isolates where Autocycler 413 \nproduced fragmented assemblies, while its some input assemblies were complete, are 414 \nnoteworthy. This result is perhaps attributable to regions of input assemblies that are too 415 \ndivergent to resolve, and highlights the need for an iterative approach, where a ‘fallback’ option 416 \nis available in case of a highly fragmented Autocycler consensus assembly. This also 417 \nemphasises the importance of quality controls (e.g.: checkM2) to flag highly fragmented 418 \nassemblies, so that for these cases, manual curation of input assemblies, optimising 419 \nparameters in the consensus process, or reversion to complete input assemblies may improve 420 \nassembly.  421 \nEvaluation of chromosomal and plasmid sequence reconstruction is challenging due to the 422 \nabsence of a ‘ground truth’. For plasmids specifically, there is a risk of mislabelling plasmids by 423 \nmethods reliant on reference databases, which may be incomplete or contain misassembled 424 \nplasmids. We therefore considered two reference plasmid sets generated from the study data. 425 \nCompared with both reference sets, none of the six assemblers had ‘perfect’ concordance. Flye 426 \nperformed poorly compared to all other assemblers, missing or misassembling ~45% of 427 \nplasmids compared with 4-10% for other assemblers. Flye struggled particularly with small 428 \n<10,000bp plasmids, as reported previously(16, 58). This emphasises the necessity of 429 \nconsensus methods like Autocycler(57), and separate plasmid recovery tools like 430 \nPlassembler(31) to optimise plasmid reconstruction. The fact that Autocycler (including four 431 \nHybracter (long) input assemblies) reconstructed a slightly different set of plasmids to a single 432 \nHybracter (long/hybrid) assembly suggests complementarity between these methods, where 433 \nAutocycler can overcome potential issues related to stochastic variation in individual 434 \nassemblies. The replicon annotation differences between identical plasmids highlights the risks 435 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nof relying on plasmid-annotation tools like MOB-suite for plasmid identification(59), and 436 \nsupports the use of network-based tools like PLING(60). 437 \nThe small differences in nucleotide-level accuracy between long- and short-read polished 438 \nAutocycler assemblies are likely not in coding regions that are key for downstream analyses. 439 \nThis is evidenced by the strong agreement in MLST profile, resistance, virulence and stress gene 440 \nannotations, and mean gene length between assemblers.  441 \n The advantage of our study is that we consider a relatively large sample of real-world, 442 \nclinically-relevant isolates. Specifically, our sample included predominantly E. coli and K. 443 \npneumoniae, which are the two most important Gram-negative species in England in terms of 444 \nnumber of bloodstream infections and burden of AMR(61), and therefore our findings are 445 \nrelevant to public health surveillance in this setting. However, a trade-off with this is the 446 \nabsence of ‘ground truth’ sequences against which to evaluate our assemblies. Other 447 \nlimitations include the empirical assessment of nucleotide-level accuracy, through aligning 448 \nshort-reads to assemblies. Both SNPs and indels were still present in a small number of short-449 \nread polished assemblies, potentially representing a baseline level of errors in either Illumina 450 \nreads or read mapping, and leading to possible overestimation of the error rate of long-read only 451 \nassemblies. A further limitation is that the performance of Autocycler as a consensus method 452 \ndepends on its input assemblies. Twenty input assemblies were used here, requiring substantial 453 \ncomputational time (13,428 CPUh), mostly due to generating assemblies, and resulted in a high 454 \ncarbon footprint, equivalent to driving 164 miles (see Environmental Impact Statement). 455 \nFurthermore, a closed consensus chromosome was not achieved for 5% of isolates using 456 \ndefault settings. Optimisation of Autocycler input assemblies and parameters, such as 457 \nweighting contigs from certain ‘more reliable’ assembler, as done in more recent automated 458 \nAutocycler v5 pipelines(20), could thus reduce computational load and improve performance. 459 \nIncorporating a ‘fallback’ option in Autocycler pipelines, for example to revert to one of the 460 \ncomplete input assemblies in cases of a highly fragmented Autocycler consensus, may also be 461 \nof benefit. Finally, generalisability to other bacterial species is limited. Other species may be 462 \nless-well represented than E. coli and Klebsiella spp. in the machine-learning training datasets 463 \nfor basecalling (Dorado) and polishing (Medaka) software, producing potentially different error 464 \nrates.     465 \nConclusions 466 \nThis assembly comparison is the first benchmarking study to demonstrate structural 467 \ncompleteness and accuracy of Nanopore super-high accuracy long-read only bacterial genome 468 \nassemblies on 92 clinical Enterobacterales isolates, compared with hybrid assembly. The 469 \nautomated consensus long-read assembler, Autocycler, accurately reconstructed assemblies, 470 \nincluding plasmids, for these isolates, and is a promising tool for integrating Nanopore long-471 \nread only assemblies into an automatable computational pipeline for public health genomics. 472 \nOngoing innovation in Nanopore sequencing technology and bioinformatic software may enable 473 \nfurther improvements and should continue to be evaluated by the bioinformatics community.  474 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nEnvironmental Impact Statement 475 \nThe Nextflow assembly pipeline used for this work ran in 72h on two AMD EPYC 9J14 96-476 \nCore Processors (188 total CPUs; 13,428 CPUh), and drew 124.46 kWh. Using Cloud 477 \ninfrastructure based in the United Kingdom, this had a carbon footprint of 28.76 kgCO2e, 478 \nequivalent to 2.61 tree-years, or 164 km in a car (calculated using green-algorithms.org 479 \nv3.0(62)). This is a lower bound estimate of the carbon footprint of this work, as it does not 480 \naccount for compute used in pipeline development, downstream statistical analyses, or the 481 \nenergy required to power display screens. The carbon footprint and wider environmental impact 482 \nof sample processing shipping has also not been accounted for. 483 \nConflict of interest 484 \nThe authors have no conflicts of interest to declare.  485 \nFunding information 486 \nThis study/research is supported/funded by the National Institute for Health Research 487 \n(NIHR) Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial 488 \nResistance (NIHR207397), a partnership between the UK Health Security Agency (UKHSA) and 489 \nthe University of Oxford. This work was also supported by the UKHSA and the NIHR Oxford 490 \nBiomedical Research Centre (BRC) and the UKHSA PhD Funding Competition. The cloud 491 \ncompute infrastructure for this work was donated by Oracle Corporation Infrastructure. The 492 \nviews expressed are those of the authors and not necessarily those of the NIHR, UKHSA or the 493 \nDepartment of Health and Social Care. 494 \nEthical approval and consent to participate  495 \nThis work has been reviewed and approved by the UKHSA Research Ethics & 496 \nGovernance Group (reference NR0429). 497 \nConsent for publication  498 \nAll authors give consent for publication of this work. No further consent for publication 499 \nwas required as this work does not include patient identifiable information.  500 \nAuthor contributions  501 \nNS, SL, SH, DC, ASW, JR, KLH, AL, DW, RH and CSB were involved in conceptualisation, 502 \nfunding acquisition, project administration, provision or resources and supervision. VP , GR, KH, 503 \nCRJ and NEKSUS consortium members were involved in isolate collection and processing. 504 \nMethodological development and validation of bioinformatic methods and software was done 505 \nby DN under the supervision of SL and NS. DN, SL and NS were involved with data curation, 506 \nanalysis, investigation, visualisation and writing/editing. All authors approved the final draft 507 \nAcknowledgments 508 \nThe authors would also like to acknowledge all participating laboratories in the NEKSUS 509 \nconsortium who were responsible for isolate collection, Zeynab Yusuf from UKHSA for her role 510 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nin sample transportation from UKHSA to Oxford, laboratory and bioinformatician colleagues at 511 \nthe Modernising Medical Microbiology Unit at the University of Oxford for support in 512 \nmethodological development and execution, as well as GENEWIZ Germany GmbH (Leipzig, 513 \nGermany) for performing long- and short-read sequencing.  514 \nIndividuals within the NEKSUS consortium group authorship are (listed alphabetically):  515 \n- Alan McNally (University Hospitals Birmingham NHS Foundation Trust) 516 \n- Caroline Cullerton (The Newcastle-upon-Tyne Hospitals NHS Foundation Trust) 517 \n- Gabriella Shanks (Barts Heath NHS Trust) 518 \n- James Price (University Hospital Sussex NHS Foundation Trust) 519 \n- Jasvir Nahl (Leeds Teaching Hospitals NHS Trust) 520 \n- Jenny Bradbury (UKHSA) 521 \n- Jonathan Lambourne (Barts Health NHS Trust) 522 \n- Julie Samuel (The Newcastle-upon-Tyne Hospitals NHS Foundation Trust) 523 \n- Jumoke Sule (UKHSA/ Cambridge University Hospitals NHS Foundation Trust)  524 \n- Ian Butler (Barts Health NHS Trust) 525 \n- Kavita Sethi (Leeds Teaching Hospitals NHS Trust) 526 \n- Mark Garvey (University Hospitals Birmingham NHS Foundation Trust) 527 \n- Martin Williams (University Hospitals Bristol and Weston NHS Foundation Trust) 528 \n- Nicholas Brown (Cambridge University Hospitals NHS Foundation Trust) 529 \n- Nicola Childs (North Bristol NHS Trust) 530 \n- Paul Randell (University Hospital Sussex NHS Foundation Trust) 531 \n- Poorvi Patel (Cambridge University Hospitals NHS Foundation Trust) 532 \n- Samuel Stafford (North Bristol NHS Trust) 533 \n- Samuel Tetley (University Hospital Sussex NHS Foundation Trust) 534 \n- Simon Eccles (Manchester University Hospitals NHS Foundation Trust)  535 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nSupplementary Figures  536 \nSupplementary Figure S1: Quality control metrics of raw and subsampled Illumina short-537 \nreads and Dorado v5.0.0 super accurate basecalled Nanopore long-reads. Showing long-538 \nread subsampled set 1 (of 4) for the 92 pure culture isolates. N50 and N50_num (or L50) are 539 \nboth measures of sequence contiguity(63). N50 is the sequence length of the shortest contig at 540 \n50% of the total assembly length. N50_num is defined as the count of the smallest number of 541 \ncontigs whose added length makes up at least half of genome size.  542 \n  543 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nSupplementary Figure S2: Plasmid sequence reconstruction for 92 Enterobacterales 544 \nisolates by different long-read only and hybrid assemblers, using the manually-curated 545 \nconsensus ‘reference’ plasmid set (n=303 plasmids). Reference plasmids in the manually 546 \ncurated set are circular contigs between 1,000-400,000bp in length that are present in at least 2 547 \nassemblers with a matching length (±10%) and mash distance (<0.025). a) Upset plot showing 548 \nassembly status combinations of plasmids across assemblers. Dark circles/bars indicate 549 \n‘present’ plasmids where length (±10%), mash distance (<0.025) and circularity all matched the 550 \n‘reference’ plasmid, lighter colours indicate misassembled plasmids, where the length 551 \ndifference was >10%, mash distance >0.025, or the contig was non-circular and the palest 552 \nshades indicate absent plasmids, where no contig was found matching other plasmids in the 553 \nreference plasmid set. b) Frequency polygon of length distribution of ‘present’ plasmids by 554 \nassembler. 555 \na) \n \n \nb) \n \n \n \n556 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nSupplementary Figure S3: Clinker plots of highly similar plasmids with different MOB-suite 557 \nannotations. Replicon annotations are shown in bright red and labelled. Other mobility- and 558 \nreplication-associated plasmid machinery are shown in pale red and labelled. a) An 85,796bp 559 \nIncFIA, IncFIB, IncFIC, rep_cluster_2131 plasmid sequence (isolate AF14) with a missing IncFIC 560 \nannotation in the Autocycler and Flye assemblies (top 2), despite a mash distance of 0 between 561 \nAutocycler and Hybracter (hybrid) assemblies. b) A 133,309bp IncFIA, IncFIB, IncFIC plasmid 562 \nsequence (isolate AHB7) with the IncFIC replicon annotation missing from the Autocycler 563 \nplasmid sequence, despite a mash distance of 0 between the Autocycler and Hybracter (hybrid) 564 \nplasmid sequences. Note the Autocycler plasmid sequence is reversed and the Flye plasmid 565 \nhas a different starting point for both plasmids. The Flye plasmid is also reversed in a) compared 566 \nto the bottom 4 assemblers’ plasmids. 567 \na) \n \nb) \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nMOBF \nIncFIC \nIncFIB \n rep_cluster_2131 \n IncFIA \n MPF \nMPF_F/T \nMOBF \nIncFIC \nIncFIB \n IncFIA \n OriT \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\n \n \nSupplementary Table S1: Species of the 92 pure culture Enterobacterales isolates, as 568 \nassigned by Kraken2(40). 569 \nSpecies Count (percentage) \nEscherichia coli 58 (63%) \nKlebsiella pneumoniae 21 (23%) \nKlebsiella oxytoca 6 (7%) \nKlebsiella aerogenes 2 (2%) \nEnterobacter hormaechei 2 (2%) \nCitrobacter freundii 1 (1%) \nCitrobacter portucalensis 1 (1%) \nSerratia marcescens 1 (1%) \nSupplementary Table S2: Raw and subsampled sequencing read metrics for Illumina short-570 \nread and Nanopore long-read sequences for 92 pure culture Enterobacterales isolates. 571 \nSupplementary Table S3: Plasmid reconstruction accuracy of different long-read only and 572 \nhybrid assemblers for Dorado v5.0.0 super accurate basecalled Nanopore long-reads. 573 \nPlasmid reconstruction is compared to a manually-curated reference set of ‘consensus’ 574 \nplasmids (n=303), where ‘consensus’ plasmids were circular contigs 1,000-400,000bp in length 575 \npresent across at least 2 assemblers with a similar length (±10%) and close mash distance 576 \n(<0.025). 577 \n \n \nAssembler  \n \np-\nvalue† \nAutocycler \n \nn (%) \nFlye \n \nn (%) \nHybracter \n(long) \nn (%) \nHybracter \n(hybrid) \nn (%) \nUnicycler \n \nn (%) \nUnicycler \n(bold) \nn (%) \nPresent* plasmids 285 (94.1%) 166 (54.8%) 272 (89.8%) 276 (91.1%) 282 (93.1%) 274 (90.4%) <0.0001 \nMisassembled** plasmids       \n  Non-circular  0 (0%) 18 (5.9%) 12 (4.0%) 13 (4.3%) 5 (1.7%) 3 (1%)  \nLength \nmismatch 7 (2.3%) 50 (16.5%) 1 (0.3%) 2 (0.7%) 6 (2.0%) 7 (2.3%)  \nNon-circular and \nlength mismatch 0 (0%) 30 (9.9%) 7 (2.3%) 7 (2.3%) 4 (1.3%) 8 (2.6%)  \nAbsent plasmids 11 (3.6%) 39 (12.9%) 11 (3.6%) 5 (1.7%) 6 (2.0%) 11 (3.6%)  \n*’Present’ plasmids are defined as contigs 1,000-400,000bp in length meeting all three match criteria: circular, length 578 \n(±10%) and mash distance (<0.025) of a the manually curated reference set of plasmids.  579 \n**Misassembled plasmids are defined as contigs that failed to meet at least 1 of the matching criteria, but could still 580 \nbe matched to the reference set based on a more distant mash distance.  581 \n***Absent plasmids were cases where only the circularity matched, or where, for an assembler, no contig could be 582 \nmatched to the rest of the reference plasmids match set based on mash distance. 583 \n†p-value for Fleiss’ Kappa test for uneven proportions of ‘present’ plasmids across all assemblers.      584 \n Raw reads Subsampled reads \n Median (IQR) \nRead depth (x genome)   \nShort-read  290 (232-340) 104 (100-108) \nLong-read 217 (158-313) 64 (59-70) \nRead length   \nShort-read  150 (150-150) 150 (150-150) \nLong-read 5858 (5366-6338) 5849 (5398-6370) \nRead quality (Q score)   \nShort-read  23.6 (23.3-23.8) 23.6 (23.3-23.8) \nLong-read 16.6 (16.4-16.8) 16.6 (16.4-16.8) \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nSupplementary Table S4: Nucleotide-level accuracy of 12 assembler-polisher 585 \ncombinations (7 long-read only, 5 hybrid). Read-alignment metrics were derived by aligning 586 \nIllumina short-reads to each assembler-polisher combination and variant calling with 587 \nFeebayes(48) from Pypolca. Mean gene length is derived from CheckM2(37) output files. 7-588 \nlocus MLST is annotated by mlst(39), and key resistance, virulence and stress genes by 589 \nAMRFinder Plus(64). 590 \n Autocycler Flye Hybracter Unicycler p-value* \nnone Medaka Medaka \nPolypolish\n+Pypolca \n none Medaka Medaka \nPolypolish\n+Pypolca \n long hybrid Normal bold \n \n MLST <0.0001 \nMLST (N=91) 90  \n(99%)†† \n90 \n (99%)†† \n90 \n (99%)†† \n90  \n(99%)†† \n90 \n (99%)†† \n91  \n(100%) \n91  \n(100%) \n91 \n(100%) \n91 \n (100%) \n91 \n(100%) \n91  \n(100%) \n91 \n (100%) \n \n Read-alignment metrics  \nSNP \n/Mb \nMedian \n(IQR) \n0  \n(0-0.17) \n0  \n(0-0) \n0  \n(0-0) \n0  \n(0-0) \n0.18  \n(0-1.17) \n0  \n(0-0.52) \n0  \n(0-0.7) \n0  \n(0-0) \n0.2  \n(0-1.26) \n0  \n(0-0) \n0 \n (0-0.37) \n0  \n(0-0.37) \n<0.0001 \nRange 0-6.54 0-7.45 0-5.27 0-3.09 0-10.81 0-35.38 0-41.79 0-4.08 0-35.37 0-10.41 0-4 0-4  \nIndels \n/Mb \nMedian \n(IQR) \n0.18  \n(0-0.39) \n0  \n(0-0.2) \n0  \n(0-0.19) \n0  \n(0-0) \n0.57  \n(0.19-1.13) \n0.18  \n(0-0.51) \n0.17  \n(0-0.36) \n0  \n(0-0) \n0.39  \n(0.19-0.75) \n0  \n(0-0) \n0  \n(0-0.2) \n0  \n(0-0.34) \n<0.0001 \nRange 0-9.5 0-17.11 0-13.12 0-5.45 0-34.11 0-18.66 0-22.35 0-4.47 0-16.71 0-12.42 0-16.21 0-16.21  \nQV Median \n(IQR) \n67  \n(63-100) \n100  \n(64-100) \n100  \n(64-100) \n100  \n(100-100) \n61  \n(57-67) \n67  \n(60-100) \n67  \n(61-100) \n100  \n(100-100) \n60  \n(58-64) \n100  \n(100-100) \n67  \n(62-100) \n67  \n(62-100) \n<0.0001 \nRange 48.8-100 47.3-100 48.4-100 50.7-100 43.48-100 42.7-100 41.9-100 51.7-100 42.8-100 46.4-100 46.9-100 46.9-100  \n CheckM2  \nMean \nGene \nLength \nMedian \n(IQR) \n312 \n (309-316) \n312  \n(309-316) \n312 \n (309-316) \n312 \n (309-316) \n312  \n(309-316) \n312  \n(308-315) \n312 \n (308-316) \n312 \n (309-315) \n312 \n (309-316) \n312 \n (309-316) \n312 \n (309-316) \n312 \n (309-316) \n<0.0001 \nRange 300-323 300-323 300-323 300-323 299-323 299-323 299-323 299-323 300-323 300-323 298-324 300-324  \n AMR Finder Plus  \nAMR Median \n(IQR) \n4 \n (1-7) \n4 \n (1-7) \n4 \n (1-7) \n4 \n(1-7) \n4 \n (1-7) \n4 \n(1-7) \n4 \n(1-7) \n4 \n(1-7) \n4 \n(1-7) \n4 \n(1-7) \n3 \n(1-7) \n4 \n(1-7) \n0.209 \nRange 0-18 0-18 0-18 0-18 0-18 0-18 0-18 0-18 0-17 0-18 0-17 0-17  \nStress Median \n(IQR) \n1 \n(0-3) \n1 \n(0-3) \n1 \n(0-3) \n1 \n(0-3) \n1 \n (0-3) \n1 \n(0-3) \n1 \n (0-3) \n1 \n(0-3) \n1 \n (0-3) \n1 \n (0-3) \n1 \n (0-2) \n1 \n (0-2) \n0.687 \nRange 0-26 0-26 0-26 0-26 0-26 0-26 0-26 0-26 0-26 0-26 0-26 0-26  \nVirulence Median \n(IQR) \n1 \n (0-7) \n1  \n(0-7) \n1 \n(0-7) \n1 \n (0-7) \n1 \n (0-6) \n1 \n (0-6) \n1 \n (0-6) \n1 \n (0-6) \n1 \n (0-6) \n1 \n (0-6) \n1 \n (0-6) \n1 \n (0-6) \n0.736 \nRange 0-35 0-35 0-35 0-35 0-35 0-35 0-35 0-35 0-35 0-35 0-35 0-35  \n*p-value for Fleiss’ Kappa test for uneven proportions of isolates with correct MLST profiles annotated across all 591 \nassemblers, or Friedman’s test for global differences in continuous variables across all assemblers.     592 \n†MLST typing schemes were only available for 91/92 pure culture isolates. The excluded sample was identified as 593 \nSerratia marcescens. 594 \n†† The incorrectly assigned MLST in one isolate by autocycler consensus assemblies, with or without polishing, was 595 \ndue to duplication of one of the seven housekeeping genes (gyrB(10,10)).  596 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\nReferences 597 \n1. Sanderson ND, Kapel N, Rodger G, Webster H, Lipworth S, Street TL, et al. Comparison of 598 \nR9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome 599 \nreconstruction. Microb Genom. 2023;9(1).10.1099/mgen.0.000910 600 \n2. Hall MB, Wick RR, Judd LM, Nguyen AN, Steinig EJ, Xie O, et al. Benchmarking reveals 601 \nsuperiority of deep learning variant callers on bacterial nanopore sequence data. Elife. 602 \n2024;13.10.7554/eLife.98300 603 \n3. Sereika M, Kirkegaard RH, Karst SM, Michaelsen TY, Sørensen EA, Wollenberg RD, et al. 604 \nOxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial 605 \ngenomes from pure cultures and metagenomes without short-read or reference polishing. Nat 606 \nMethods. 2022;19(7):823-6.10.1038/s41592-022-01539-7 607 \n4. Ni Y, Liu X, Simeneh ZM, Yang M, Li R. Benchmarking of Nanopore R10.4 and R9.4.1 flow cells 608 \nin single-cell whole-genome amplification and whole-genome shotgun sequencing. Comput Struct 609 \nBiotechnol J. 2023;21:2352-64.10.1016/j.csbj.2023.03.038 610 \n5. Foster-Nyarko E, Cottingham H, Wick RR, Judd LM, Lam MMC, Wyres KL, et al. Nanopore-611 \nonly assemblies for genomic surveillance of the global priority drug-resistant pathogen, Klebsiella 612 \npneumoniae. Microb Genom. 2023;9(2).10.1099/mgen.0.000936 613 \n6. Wick RR, Judd LM, Holt KE. Assembling the perfect bacterial genome using Oxford Nanopore 614 \nand Illumina sequencing. PLoS Comput Biol. 2023;19(3):e1010905.10.1371/journal.pcbi.1010905 615 \n7. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from 616 \nshort and long sequencing reads. PLOS Computational Biology. 617 \n2017;13(6):e1005595.10.1371/journal.pcbi.1005595 618 \n8. Heather JM, Chain B. The sequence of sequencers: The history of sequencing DNA. 619 \nGenomics. 2016;107(1):1-8.10.1016/j.ygeno.2015.11.003 620 \n9. Wang Y, Yang Q, Wang Z. The evolution of nanopore sequencing. Front Genet. 621 \n2014;5:449.10.3389/fgene.2014.00449 622 \n10. Simar SR, Hanson BM, Arias CA. Techniques in bacterial strain typing: past, present, and 623 \nfuture. Curr Opin Infect Dis. 2021;34(4):339-45.10.1097/qco.0000000000000743 624 \n11. Castaneda-Barba S, Top EM, Stalder T. Plasmids, a molecular cornerstone of antimicrobial 625 \nresistance in the One Health era. Nat Rev Microbiol. 2024;22(1):18-32.10.1038/s41579-023-00926-x 626 \n12. Dimitriu T. Evolution of horizontal transmission in antimicrobial resistance plasmids. 627 \nMicrobiology (Reading). 2022;168(7).10.1099/mic.0.001214 628 \n13. Khezri A, Avershina E, Ahmad R. Hybrid Assembly Provides Improved Resolution of Plasmids, 629 \nAntimicrobial Resistance Genes, and Virulence Factors in Escherichia coli and Klebsiella pneumoniae 630 \nClinical Isolates. Microorganisms. 2021;9(12).10.3390/microorganisms9122560 631 \n14. Arredondo-Alonso S, Willems RJ, van Schaik W, Schurch AC. On the (im)possibility of 632 \nreconstructing plasmids from whole-genome short-read sequencing data. Microb Genom. 633 \n2017;3(10):e000128.10.1099/mgen.0.000128 634 \n15. Sanderson ND, Hopkins KMV, Colpus M, Parker M, Lipworth S, Crook D, et al. Evaluation of 635 \nthe accuracy of bacterial genome reconstruction with Oxford Nanopore R10.4.1 long-read-only 636 \nsequencing. Microb Genom. 2024;10(5).10.1099/mgen.0.001246 637 \n16. Abdel-Glil MY, Brandt C, Pletz MW, Neubauer H, Sprague LD. High intra-laboratory 638 \nreproducibility of nanopore sequencing in bacterial species underscores advances in its accuracy. 639 \nMicrobial Genomics. 2025;11(3).https://doi.org/10.1099/mgen.0.001372 640 \n17. Bouras G, Judd LM, Edwards RA, Vreugde S, Stinear TP, Wick RR. How low can you go? Short-641 \nread polishing of Oxford Nanopore bacterial genome assemblies. Microb Genom. 642 \n2024;10(6).10.1099/mgen.0.001254 643 \n18. De Maio N, Shaw LP, Hubbard A, George S, Sanderson ND, Swann J, et al. Comparison of 644 \nlong-read sequencing technologies in the hybrid assembly of complex bacterial genomes. Microb 645 \nGenom. 2019;5(9).10.1099/mgen.0.000294 646 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\n19. Bouras G, Houtak G, Wick RR, Mallawaarachchi V, Roach MJ, Papudeshi B, et al. Hybracter: 647 \nenabling scalable, automated, complete and accurate bacterial genome assemblies. Microb Genom. 648 \n2024;10(5).10.1099/mgen.0.001244 649 \n20. Wick RR. Autocycler. 2025. 650 \n21. Wick RR, Judd LM, Cerdeira LT, Hawkey J, Méric G, Vezina B, et al. Trycycler: consensus long-651 \nread assemblies for bacterial genomes. Genome Biology. 2021;22(1):266.10.1186/s13059-021-652 \n02483-z 653 \n22. Zhou A, Lin T, Xing J. Evaluating nanopore sequencing data processing pipelines for structural 654 \nvariation identification. Genome Biology. 2019;20(1):237.10.1186/s13059-019-1858-1 655 \n23. illumina. bcl2fastq2 Conversion Software v2.20. 2017. 656 \n24. Oxford Nanopore Technologies. Dorado v0.9 2024 [Available from: 657 \nhttps://github.com/nanoporetech/dorado?tab=readme-ov-file#alignment. 658 \n25. Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File 659 \nManipulation. PLOS ONE. 2016;11(10):e0163962.10.1371/journal.pone.0163962 660 \n26. Hall MB. Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of 661 \nOpen Source Software. 2022; 7(69):3941.https://doi.org/10.21105/joss.03941 662 \n27. Kolmogorov M, Yuan J, Lin Y, Pevzner P. Assembly of Long Error-Prone Reads Using Repeat 663 \nGraphs. Nature Biotechnology. 2019.doi:10.1038/s41587-019-0072-8 664 \n28. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and 665 \naccurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 666 \n2017;27(5):722-36.10.1101/gr.215087.116 667 \n29. Vaser R, Šikić M. Time- and memory-efficient genome assembly with Raven. Nature 668 \nComputational Science. 2021;1(5):332-6.10.1038/s43588-021-00073-4 669 \n30. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. 670 \nBioinformatics. 2016;32(14):2103-10.10.1093/bioinformatics/btw152 671 \n31. Bouras G, Sheppard AE, Mallawaarachchi V, Vreugde S. Plassembler: an automated bacterial 672 \nplasmid assembly tool. Bioinformatics. 2023;39(7).10.1093/bioinformatics/btad409 673 \n32. Lee JY, Kong M, Oh J, Lim J, Chung SH, Kim JM, et al. Comparative evaluation of Nanopore 674 \npolishing tools for microbial genome assembly and polishing strategies for downstream analysis. Sci 675 \nRep. 2021;11(1):20740.10.1038/s41598-021-00178-w 676 \n33. Wick RR, Holt KE. Polypolish: Short-read polishing of long-read bacterial genome assemblies. 677 \nPLOS Computational Biology. 2022;18(1):e1009802.10.1371/journal.pcbi.1009802 678 \n34. Zimin AV, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections 679 \nin genome assemblies. PLOS Computational Biology. 680 \n2020;16(6):e1007981.10.1371/journal.pcbi.1007981 681 \n35. Chklovski. CheckM2. 1.1.0 ed2025. 682 \n36. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the 683 \nquality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 684 \n2015;25(7):1043-55.10.1101/gr.186072.114 685 \n37. Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. CheckM2: a rapid, scalable and accurate 686 \ntool for assessing microbial genome quality using machine learning. Nature Methods. 687 \n2023;20(8):1203-12.10.1038/s41592-023-01940-w 688 \n38. Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and 689 \nstandardized annotation of bacterial genomes via alignment-free sequence identification. Microbial 690 \nGenomics. 2021;7(11).https://doi.org/10.1099/mgen.0.000685 691 \n39. Seemann, Torsten. mlst. 2.23.0 ed: Github. 692 \n40. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome 693 \nBiology. 2019;20(1):257.10.1186/s13059-019-1891-0 694 \n41. Robertson J, Nash JHE. MOB-suite: software tools for clustering, reconstruction and typing of 695 \nplasmids from draft assemblies. Microb Genom. 2018;4(8).10.1099/mgen.0.000206 696 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint \n\n42. Robertson J, Bessonov K, Schonfeld J, Nash JHE. Universal whole-sequence-based plasmid 697 \ntyping and its utility to prediction of host range and epidemiological surveillance. Microb Genom. 698 \n2020;6(10).10.1099/mgen.0.000435 699 \n43. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, et al. Mash Screen: high-700 \nthroughput sequence containment estimation for genome discovery. Genome Biology. 701 \n2019;20(1):232.10.1186/s13059-019-1841-x 702 \n44. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast 703 \ngenome and metagenome distance estimation using MinHash. Genome Biology. 704 \n2016;17(1):132.10.1186/s13059-016-0997-x 705 \n45. Csárdi G, Nepusz T, Traag V, Horvát S, Zanini F, Noom D, et al. igraph: Network Analysis and 706 \nVisualization in R. R package version 2.1.4 ed2025. 707 \n46. Csardi G, Nepusz T. The igraph software package for complex network research. 708 \nInterJournal, Complex Systems. 2006;1695 709 \n47. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 710 \n2013:1303.3997 711 \n48. Garrison E, Marth G. Haplotype-based variant detection from short-read  sequencing. arXiv. 712 \n2012:e1207.3907 713 \n49. R Core Team. R: A Language and Environment for Statistical Computing. 4.4.1 ed2021. 714 \n50. Wickham H. ggplot2: Elegant Graphics for Data Analysis. Verlag New York: Springer; 2016. 715 \n51. Hadley Wickham, Averick M, Bryan J, Chang W, McGowan LDA, François R, et al. Welcome to 716 \nthe tidyverse. Journal of Open Source Software. 2019;4(43):1686.10.21105/joss.01686 717 \n52. Auguie B, Antonov A. gridExtra: Miscellaneous Functions for \"Grid\" Graphics 718 \n2.3 ed2017. 719 \n53. Wilke CO. cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2'. 2024. 720 \n54. Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research R 721 \npackage version 2.5.6 ed. Evanston, Illinois: Northwestern University; 2025. 722 \n55. Gamer M, Lemon J, Fellows I, Singh P. irr: Various Coefficients of Interrater Reliability and 723 \nAgreement. 0.84.1 ed2019. 724 \n56. Gilchrist CLM, Chooi Y-H. clinker &amp; clustermap.js: automatic generation of gene cluster 725 \ncomparison figures. Bioinformatics. 2021;37(16):2473-5.10.1093/bioinformatics/btab007 726 \n57. Wick RR, Howden BP, Stinear TP. Autocycler: long-read consensus assembly for bacterial 727 \ngenomes. bioRxiv. 2025.10.1101/2025.05.12.653612 728 \n58. Wick RR, Judd LM, Wyres KL, Holt KE. Recovery of small plasmid sequences via Oxford 729 \nNanopore sequencing. Microb Genom. 2021;7(8).10.1099/mgen.0.000631 730 \n59. Douarre PE, Mallet L, Radomski N, Felten A, Mistou MY. Analysis of COMPASS, a New 731 \nComprehensive Plasmid Database Revealed Prevalence of Multireplicon and Extensive Diversity of 732 \nIncF Plasmids. Front Microbiol. 2020;11:483.10.3389/fmicb.2020.00483 733 \n60. Frolova D, Lima L, Roberts L, Bohnenkämper L, Wittler R, Stoye J, et al. Applying 734 \nrearrangement distances to enable plasmid epidemiology with pling. bioRxiv. 735 \n2024:2024.06.12.598623.10.1101/2024.06.12.598623 736 \n61. UK Health Security Agency. English surveillance programme for antimicrobial utilisation and 737 \nresistance (ESPAUR) Report 2023 to 2024. 2024. 738 \n62. Lannelongue L, Grealey J, Inouye M. Green Algorithms: Quantifying the Carbon Footprint of 739 \nComputation. Adv Sci (Weinh). 2021;8(12):2100707.10.1002/advs.202100707 740 \n63. Wikipedia. N50, L50, and related statistics 2024 [Available from: 741 \nhttps://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics. 742 \n64. Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, et al. 743 \nAMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among 744 \nantimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11(1):12728.10.1038/s41598-745 \n021-91456-0 746 \n.CC-BY 4.0 International licenseperpetuity. It is made available under a \npreprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in \nThe copyright holder for thisthis version posted September 17, 2025. ; https://doi.org/10.1101/2025.09.15.676237doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}