Sub-consensus haploid variant calling in Long-read sequencing technology

preprint OA: closed
Full text JSON View at publisher
Full text 96,335 characters · extracted from preprint-html · click to expand
Sub-consensus haploid variant calling in Long-read sequencing technology | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Sub-consensus haploid variant calling in Long-read sequencing technology Xavier Zair, Andreas Wilm, Miles C Benton, Cheng Yong Tham, Lin Yang, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6226988/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background Next-generation sequencing (NGS) has become crucial in epidemiology, particularly for tracking viral evolution during outbreaks. While Oxford Nanopore Technologies (ONT) sequencing has gained popularity due to its long-read capabilities and cost-effectiveness, accurately identifying low-frequency variants in long-read data remains challenging. LoFreq, a commonly used variant caller for identifying rare variants in haploid datasets, was developed for short reads. This study aims to validate the use of LoFreq on long-read data and propose a calibration method to enhance accuracy. Methods We constructed truth sets using three plasmids containing SARS-CoV-2 spike genes (7179 bases) with 100 SNVs between them, as well as full-length Escherichia coli genomes. Libraries were sequenced on R9.4.1 and R10.4.1 flow cells. Recall was benchmarked with LoFreq and compared between flow cell chemistries and library size. We also developed a method to adjust base quality (Phred) scores to improve accuracy in long-read datasets. Results LoFreq demonstrated high sensitivity for detecting variants at allelic frequencies as low as 0.1, particularly with R10.4.1 chemistry. However, false discovery rates (FDR) were significant, varying by sequencing depth and chemistry. R10.4.1 showed superior performance in both sensitivity and FDR compared to R9.4.1. We propose a Phred score calibration method that significantly reduced false positives while maintaining recall rates in specific cases. However, it was found to be unsuitable for recalling variants at less than 10% and for structural variant discovery as they suffered significant recall loss. Conclusion While LoFreq remains useful for low-frequency variant calling in long-read data, high false discovery rates on either flow cell chemistry make its direct use on long-read data inadvisable. Our proposed quality score adjustment allows for improved detection of sub-consensus variants while reducing false discoveries. Though more fine-tuning is required for broader applicability, these findings address the lack of sub-consensus variant calling tools for long-read datasets and provide an adequate workaround for applying LoFreq to nanopore reads, which is crucial for future outbreak surveillance and pathogen evolution studies Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction With numerous use cases, ranging from cancer screening to evolutionary studies, Next generation sequencing (NGS) has become instrumental in epidemiology in recent years [ 1 , 2 ]. During the COVID-19 pandemic, whole-genome sequencing (WGS) emerged as a popular method in epidemiological surveillance [ 2 – 4 ]. At of the time of writing, approximately 17 million genome sequences have been uploaded to The Global Initiative on Sharing All Influenza Data (GISAID) alone. This unprecedented level of data generation and sharing is expected to increase further with the growing accessibility of NGS, particularly in future outbreaks. Generated genomes have been invaluable in tracking mutations and viral evolution, influencing policy and interventions such as vaccine development [ 3 , 5 – 8 ]. Data analysis pipelines and workflows for NGS have advanced significantly, producing accurate viral consensus genomes [ 9 – 11 ]. However, beyond consensus level analysis, there has been renewed interest in investigating minor variants as well. This area of research has been emerging as a means to elucidate viral micro-evolution, drug resistance emergence, and immune evasion mechanisms relevant to vaccine efficacy [ 7 , 12 – 14 ]. Oxford Nanopore Technologies (ONT) sequencing technology has captured a significant share of the next-generation sequencing market over the last years. Factors such as long-read capabilities, cost-effectiveness, and ease of use have contributed to this growth [ 15 , 16 ]. It has been estimated that about a fifth of SARS-CoV2 NGS reads uploaded to online data repositories were generated on ONT platforms [ 2 ]. However, accurately characterising low-frequency variants is challenging, particularly for longer read technologies due to increased noise levels [ 17 , 18 ]. Polishing methods have proven viable for increasing overall accuracy substantially. However, they often demand laborious analysis and may compromise depth, overlooking rare variants [ 19 ], and in some cases, require additional short read data [ 20 ]. Emerging error correction tools, such as HERRO, are not able to be applied to shorter reads (< 4,096bp), limiting their application to data sets with longer reads at this point in time [ 21 ]. Some pipelines resort to blacklisting complex genomic regions to reduce noise, but this approach can compromise resolution, and existing variant analysis tools often exhibit poor single nucleotide variant (SNV) discovery at low frequencies [ 7 , 22 ]. Widely used pipelines like ARTIC-ONT and COG-UK focus on accurate consensus genome generation, limiting variant population analysis to experimental modes [ 23 , 24 ]. Instead of correcting errors at the read level, some variant calling toolkits filter variants post-calling, targeting error patterns inherent to long-read sequencing [ 17 , 25 , 26 ]. However, these solutions are often genus specific and generalizing to other domains can be challenging. Presently, tools for precise identification of rare and sub-consensus variants in haploid genomes from fourth generation NGS data are lacking. Rare variant calling tools, such as LoFreq, were developed for short read data of higher quality and have not undergone extensive validation on long read platforms [ 27 ]. Polishing to increase read quality is not feasible since it may remove the low-frequency variants of interests. Traditionally, variant caller validation studies insert variants in silico, artificially modifying bases in NGS datasets for benchmarking purposes [ 17 , 28 ]. While this method offers flexibility, it may fail to accurately represent real biological events and cannot account for the sequencing biases intrinsic to each technology [ 29 ]. The alternative approach to benchmark variant calling performance by comparing shared variants may be inaccurate due to inherent errors in Illumina reads [ 30 ]. Relying on cross-concordance, this method frequently misses errors in Illumina reads, as the ground truth is uncertain. This study aims to validate the use of the rare variant caller, LoFreq, on long-read data utilizing a curated truth set with known variant compositions. Specifically, we intend to benchmark the recall of variants in shorter plasmid reads, which emulate viral genomes, and in full-length Escherichia coli genomes to enable indel recall analysis. Additionally, we propose a calibration method aimed at reducing base quality (or Phred) scores to enhance accuracy in long-read datasets. Our findings indicate that while LoFreq remains useful for low-frequency variant calling in long-read chemistry, the high rate of false discoveries is a significant drawback. However, with quality score adjustment, rare variants can still be detected confidently, while significantly reducing false discoveries across our datasets. As nanopore sequencing becomes more cost-effective and portable, the proportion of pathogen data generated on these platforms is anticipated to rise in future outbreaks. This study aims to address the lack of sub-consensus variant calling tools for long-read datasets by proposing workarounds to effectively apply LoFreq to nanopore reads. Methods [Truth set construction] Construction of truth set libraries followed two distinct approaches. The first approach utilised plasmids on a pUNO backbone that contained the Delta variant SARS-CoV-2 spike gene (InVivogen, p1-spike-v8, 7179 base-pairs). Two distinct chimeric plasmids were generated by replacing a 241-base-pair segment (between nucleotides 1,750-1,990) with the original Wuhan-wet market (wild-type, Clade 19) and Omicron variant spike genes. Chimeric inserts from wild-type and Omicron variants were PCR-amplified from their donor plasmids. Inserts from recipient plasmids were excised via restriction digestion with KpnI and NotI (New England Biolabs), and donor amplicons were incorporated using Gibson assembly (New England Biolabs, NEBuilder® HiFi DNA Assembly Master Mix). Comparative alignment of the plasmids indicates an 85-base-pair difference. Of the 85 variants, 70 display a single nucleotide difference across all three spike plasmids. Conversely, the remaining 15 positions offer more than two possibilities, for instance, A, T, and G from Delta, Omicron, and wild-type respectively. When these three plasmids are combined, the cumulative count of detectable variants adds up to 100 (70 + 15×2). This design evaluates both the variant caller's recognition of distinct variants and its differentiation ability among three nucleotide options at specific positions. Sanger sequencing verified the accurate integration of chimeric sequences. Plasmids were expanded natively on K-12 DH5α (New England Biolabs) strains and purified by Miniprep (QIAprep Spin Miniprep Kit, Qiagen). The alternate strategy targeted the sequencing of entire genomes from two Escherichia coli K12 strains. We selected two strains for this purpose, HB101 (Life Technologies, Max Efficiency HB101 Competent Cell) and DH5α (NEB® 5-alpha Competent E. coli High Efficiency), which showed a high pairwise similarity of 95.2%. Between the two selected E.coli K-12 strains, a total of 1496 substitutional variants, 13770 insertions, and 47990 deletions were identified through pairwise alignment, with an expected frequency of 10 percent. Each strain was grown overnight in Lennox LB broth base (ThermoFisher Scientific) at 37°C and genomes were subsequently extracted and purified using the PureLink™ Genomic DNA Mini Kit (Invitrogen). [Sequencing and flow cell chemistry] To evaluate the performance of the updated pore chemistry, libraries were sequenced on R9.4.1 and R10.4.1 flow cells. Spike plasmid ratios were precisely stoichiometrically mixed, targeting allele frequencies (AF) of 0.01, 0.05, 0.1, 0.15 and 0.19. Bacterial genomes were mixed exclusively at a frequency of 0.1. R9.4.1 libraries were prepared using ONTs ligation sequencing kit SQT-LSK109 and barcoded with native barcode expansions 1–12 (EXP-NBD104). Libraries for R10.4.1 flow cells were generated using the Native Barcoding Kit 24 (SQK-NBD114.24). Each of the three plasmids received individual barcoding before undergoing sequencing on a singular flow cell. Each bacterial strain was sequenced on a dedicated flow cell, while a separate flow cell was used for the mixed sample preparation. The flow cells were run for 72 hours to achieve maximal depth on ONT’s GridION Mk1 system. [Bioinformatic analysis] Raw sequence files (fast5 and pod5) were base-called using Dorado (v0.5.3) using the super accurate model (v4.2.0). ‘Passed’ reads were filtered using the NanoFilt (v2.8.0) package, with a minimum read quality cut-off of 10. Adaptors and barcodes were trimmed using Porechop (v0.2.3_seqan2.1.1), using the -auto setting. A consensus for the bacterial genomes was generated using de novo assembly with the Flye (v2.9) assembler, serving as a baseline for variant identification between K-12 strains. Consensus genomes were aligned pairwise using MAFFT (v7.310) with the --auto setting. The resulting de novo reference alignments (genomes) and in-silico references (plasmids) were annotated using the Geneious Prime suite (Java Version 11.0.20.1 + 1) to generate a list of expected variants used for benchmarking. Reference-based alignment was carried out using Minimap2 (v2.17-r941), with the ‘-ax map-ont’ parameters. Samtools (v1.6) was used for the conversion of SAM to BAM, as well as for indexing and sorting of alignment files. To ensure consistency, reads were normalised to standard depths, as depth was suspected to be a critical variable in variant calling. Datasets were normalised to a consistent depth, set by the dataset with the minimum depth. To mitigate sampling error, all filtered and trimmed reads underwent twelve rounds of subsampling across all sequencing chemistries, facilitated by the Rasusa (v0.6.1) script, with the reference length as the input. Bacterial genome datasets necessitated a more targeted, reference-based method of depth normalisation due to their significantly longer genomes. This was achieved using the Jvarkit toolkit (v4b65b20b2) with the following parameters, substituting SAM alignments as input: java -jar ./jvarkit.jar biostar154220 -d -o This operation removes all reads exceeding the target depth, retaining only those reads with the highest mapping quality. For indel analysis on bacterial genomes, a uniform indel quality of 50 was added to BAM files using LoFreq indelqual with parameters -u 50. Variant calling with LoFreq (v2.1.3.1), was run with -B mode, thereby disabling BAQ computation, which is optimized for shorter Illumina reads [ 31 ]. Alignment quality, coverage and depth information were verified with the Qualimap (v.2.2.2-dev) toolkit, using the bamqc parameter. Longshot (v0.4.1) and Clair3 (v0.1-r10) were also used to call variants, which will be briefly elaborated in the results. [Quality score analysis and adjustment] Reads from clonal plasmids were trimmed and filtered according to the previously mentioned settings. Random extraction of 1 million bases from the processed FASTQ files was performed over 12 iterations using Rasusa’s base extraction parameter (-b). Using the reformat.sh function under qchist parameters in the BBmaps suite of toolkits (v38.18), Phred scores were extracted from subsampled FASTQ files. Phred distribution graphs were created from the resultant CSV files (Fig. 5a). Minimap2 was used once again to align subsampled FASTQ files to reference genomes, applying the above-mentioned settings. SAM files were reanalyzed using the reformat.sh function under the qahist parameters and CSV files containing encoded and empirical Phred values were used to generate QvsQ plots (Fig. 3b, 3c). Base quality (Phred) score reduction was carried out using a custom Nim binary (QUAD v0.2) [ 32 ]. Input CSV files formatted with "before" and "after" values were used to modify base quality scores by substituting specific integer Phred values with their corresponding adjustments. From the clonal plasmid QvsQ plots, we derived best fit curves—quadratic, third-degree polynomial, and Gaussian— which were used to calibrate the corresponding values. Uniform scalar reductions were applied to all Phred scores, where each score was decreased by a specified value (e.g., a reduction of 1 lowered all scores by 1, while a reduction of 2 lowered all scores by 2), with ranges of 1 to 8 for genomic datasets and 1 to 10 for plasmid datasets. Calibration CSV files associated with each curve can be found in the supplementary materials. A control set was also established, maintaining unchanged Phred values to rule out file manipulation or corruption as causes of reduced false positives, attributing the effect to actual change. The adjusted BAM files were indexed using samtools, followed by variant calling with LoFreq under the previously described parameters. All graphs and figures were generated with GraphPad Prism (v8.0.1). Results [Sensitivity and false discovery] The use of LoFreq on Nanopore data resulted in false calls across all datasets. To quantify the extent of false discovery between nanopore chemistries, reads were sub-sampled to fixed depths. Within clonal plasmid libraries, which are characterised by a single population of plasmids devoid of variation, LoFreq reported a significant number of variants. Across 12 replicates at a uniform depth of 2,000, we noted averages of 8.25 (SD 1.48) and 2.00 (SD 1.04) false calls on R9.4.1 and R10.4.1 flow cells, respectively (Fig. 1). For mixed plasmid datasets at the same depth, the average false positive calls were 8.67 (SD 1.48) for R9.4.1 and 4.88 (SD 1.08) for R10.4.1. This contributed an FDR of 12.9% and 6.3% on R9.4.1 and R10.4.1 flow cells respectively (Table 1). Within the genomic datasets containing variants with AF of 0.1, the average FDR varied drastically by depth (Fig. 2a). This effect was also observed in plasmid datasets, although to a much smaller extent, presumably due to smaller genome size. On R9.4.1 flow cells, we observed a gradual increase in FDR as depth increased. For R10.4.1, however, we observed a constant decrease in the FDR with every unit of increase in depth, eventually plateauing at a false discovery rate of about 25%. [Recall per chemistry] Within plasmid datasets, LoFreq detected all variants at frequencies of 0.1, 0.15, and 0.19 when using R10.4.1 chemistry, although its recall for variants at allele frequency of 0.05 was notably low. For R9.4.1 flow cells, the recall rates for 0.1 and 0.15 frequency variants were reduced to 90.7% and 95.1%, respectively, while the recall for 0.05 variants improved to 41.1% (Table 1). In detecting variants with AF of 0.1 in bacterial genomes, R9.4.1 flow cells showed a low recall rate of less than 5%, whereas R10.4.1 chemistry achieved a high recall rate of 88.7% at identical depths of 300X. Indel discovery was comparable between flow cells, whereas R10.4.1 demonstrated superior recall for insertional variants specifically at higher sequencing depths (Fig. 2). [Phred quality score encoding and reduction] Analysis of per-base Phred quality score distribution revealed differences between chemistries. Both chemistries had a high modal score, with R9.4.1 at 31 and R10.4.1 at 38. Notably, R10.4.1 reads exhibited a significant number of bases with a score of 50, which would be the true modal score. However, we focus on the scores within the range of 31–38, as these offer a more robust estimate of central tendency (Fig. 3a). Our QvsQ (Fig. 3b, 3c) plots showed a slight increase in quality scores starting at Phred scores of 12 for R9.4.1 chemistry, whereas R10.4.1 chemistry began deviating around Phred scores of 20. While R10.4.1 chemistry maintained relatively accurate Phred scores between 40–50, R9.4.1 showed substantial deviation in this range (Fig. 3b). Overall, both chemistries exhibited Phred score deviation from expected, with R10.4.1 showing greater accuracy compared to R9.4.1, indicating a need for adjustment. We explored the impact of fitted calibration as well as scalar reductions in Phred scores on the occurrence of false positives. For this, we focused on the R9.4.1 mixed plasmid dataset to assess potential approaches. Scalar reduction demonstrated the most effective suppression of false positives compared to all fitted calibration methods (Table 2). Consequently, we aimed to determine the magnitude of Phred score reduction necessary to eliminate false positives in the clonal plasmid datasets. Given that these datasets are clonal, any identified variants could be immediately classified as false positives. A quality offset of one led to an immediate reduction in false calls, while a Phred reduction of seven completely prevented false calls on R9.4.1 pores, while R10.4.1 required a smaller Phred reduction of six to eliminate all false positives (Fig. 1). In the unsampled clonal R10.4.1 datasets with average depths of 50,000 (not shown), a Phred reduction of 6 failed to eliminate all variants, necessitating a Phred reduction of seven. As the retention of true positives is of higher importance, we applied further scalar reduction to datasets with mixed populations, expanded in the following sections. [Effect of true calls in mixed plasmid datasets] The impact of scalar Phred reduction on false call elimination and true call retention exhibited a non-linear pattern and varied across different chemistries and variant populations. Despite significant Phred reductions of up to ten, variants with an AF of 0.19 in R9.4.1 chemistry remained unaffected, while discovery rates for variants at 0.15 frequency and lower were impacted (Table 1). The cumulative recall loss for 0.15 frequency variants was about 9.77% with a Phred reduction of ten, demonstrating a non-linear trend, as reductions between − 1 and − 7 only caused a 0.57% loss. Unfortunately, we observed a larger recall loss as variant frequency decreased, with variants of AF of 0.1 and 0.05 losing about 19.1% and 36.7% recall respectively. In contrast, variant discovery for frequencies of 0.1 and above remained unaffected on R10.4.1 flow cells. A reduction of about 1.44% in 0.05 variant recall was observed; however, because the baseline variant discovery was low, this led to only a 0.2-fold reduction in total discovery (Table 1). Positions with multiple possible variants (described in our methods) were most susceptible to recall loss. [Effect of true calls in mixed bacterial genomic datasets] Reducing phred scores up to eight had a minor impact on the recall of substitutional variants, particularly in R10.4.1 chemistry, since the baseline recall for R9.4.1 was less than 5% (Fig. 4). Conversely, Phred score reductions markedly decreased indel detection in both chemistries. As observed in the plasmid datasets, the reduction in false positives was non-linear, particularly in the R10.4.1 dataset (Fig. 5). Regardless, reduction of up to eight in Phred scores led to a 70–75% decrease in false positives across both chemistries. However, some false positives persisted on both chemistries even at the max Phred score reduction of eight. Discussion The analysis of rare and sub-consensus variants in haploid pathogens such as bacteria, fungi, and viruses has a profound impact on our understanding of these organisms [ 33 – 35 ]. Although LoFreq is available for illumina data, there are currently no adequate solutions for accurate sub-consensus variant calling on ONT reads. Furthermore, the efficacy of LoFreq on long-read data has yet to be benchmarked, yet it continues to be utilized in practice [ 28 , 29 , 36 ]. ONT sequencing is particularly appealing for pathogen whole-genome sequencing due to its portability and low cost. This accessibility and cost-effectiveness have facilitated field-based genome generation during novel pathogen outbreaks [ 37 , 38 ]. Furthermore, off the field, nanopore technology retains an advantage in haplotype variant studies because of its longer read lengths [ 8 , 39 ]. In this study, we developed curated truth sets that allowed us to assess sensitivity and precision of the short-read variant caller, LoFreq, when applied to long-read technology. Our findings indicate that the application of LoFreq to long-read technology is feasible, demonstrating strong recall rates for rare variants. However, high false discovery rates render it unsuitable for immediate use, necessitating further calibration. Our study showed that the FDR of LoFreq variant calling can be reduced by adjusting the Phred scores of long reads. Commonly used nanopore variant callers employ techniques such as haplotype phasing and windowed read analysis to mitigate errors. However, since none of these methods are specifically designed as rare variant callers like LoFreq, direct comparisons would be inappropriate. By default, LoFreq employs base alignment quality (BAQ) to enhance the specificity of variant calling. This process is critical, particularly in datasets with higher error rates. Unfortunately, BAQ calculations are inherently slow and optimized for Illumina reads, thus rendering them unsuitable for longer reads. Therefore, we have chosen to investigate adjustments to Phred quality scores to address the reduced specificity. Adjustments to quality scores in sequencing reads have been a topic of discussion for several decades, with proposals targeting Phred scores within Sanger sequencing reads [ 40 , 41 ]. However, our attempts to calibrate quality scores using conventional methods have resulted in varied outcomes, ranging from a minor reduction in false discovery to a detrimental increase in some cases. We believe that the observed increase in false discoveries is due to underfitting, which can artificially inflate Phred scores in certain ranges. Notably, raising quality scores of bases within the critical Phred range of 10–25 led to significant increases in false discoveries as noted in the quadratic fit. In contrast, curves with more accurate alignments that only reduced Phred scores resulted in a net decrease in false positive rates (Table 2). Unfortunately, we observed that while the curves that fit better were more accurate, they did not lead to significant reductions in false positive occurrences. Overall, the effect of curve fitting methods on reducing Phred scores was too subdued. This was also evident when using the auto-calibration method provided by BBMap: toolkits such as BBMap (BBMap - Bushnell B. - sourceforge.net/projects/bbmap/) offer automatic correction of reported Phred values based on empirical alignment-derived calculations; however, we noted only minor improvements in false positive reduction. Our observations are in line with Rivara-Espasandín et al.’s findings, that concluded quantizing reads that range from 0–93 into four quantised values of 5, 12, 18, and 24 had very little overall effect on variant identification in human and microbial genomes [ 42 ]. On the other hand, a single scalar reduction of Phred scores proved more effective at reducing FDR than any of the tested fits. Further reduction of Phred scores led to additional decreases in FDR, resulting in the complete removal of false positives in some of our datasets. Most importantly, our findings demonstrate that while Phred score reduction effectively reduces false positives, true positives are largely retained. The recall of variants at allelic frequency of 0.1 and above in both R10.4.1 and R9.4.1 flow cells chemistries were maintained even at aggressive Phred reduction of up to 8. Unfortunately, variants at 5% or less are more adversely affected by Phred score reduction; thus, making this method unsuitable for rare variant identification. Additionally, indel identification is diminished with quality score reduction, making this approach less suitable for applications that rely on accurate indel detection. However, for genomes with little structural variations, such as those found in viral or mutator bacteria, Phred score reduction can be effective. In conclusion, our method offers an alternative approach for utilizing the rare variant caller LoFreq with nanopore data, particularly given that most tools in the nanopore space prioritize base calling accuracy or alignment refinement. We suggest that further exploration of quality score adjustments may enhance the adaptability of existing tools for nanopore sequencing applications. However, until a dedicated low-frequency variant caller for nanopore sequencing becomes available, our method remains an effective means of accurately identifying low-frequency variants in nanopore long reads. Abbreviations AF Allele frequency BAQ Base alignment quality FDR False discovery rate NGS Next Generation Sequencing ONT Oxford Nanopore Technologies PCR Polymerase Chain Reaction QvsQ Empirical Phred vs. encoded Phred scores SD Standard deviation SNV Single nucleotide variant SUP Super accurate VCF Variant call file WGS Whole Genome Sequencing Declarations Ethics approval and consent to participate Not applicable Consent for publication Not applicable Funding None Authors' contributions X.Zair generated the main body manuscript text as well as analysis, interpretation, preparation of figures AW contributed to sustained support in the use of LoFreq on data specific to the manuscript, as well as the development of QUAD and the support of its use. MCB, CYT, LY and PFdS - authors affiliated with Oxford Nanopore Technologies assisted with data generation, technical support on data generation on Nano pore technology. And finally formatting and editing of final manuscript. OMS, CEH, and SM are the corresponding author's thesis advisors, contributing to project direction, refinement and progress assessment. SM is the corresponding authors main thesis advisor and Principal Investigator. All authors have reviewed the manuscript. Data Availability Raw FASTQ truth set reads of both plasmid and genome sequences were uploaded to NCBI Sequence Read Archive (SRA) under bioproject PRJNA1245633 at: https://www.ncbi.nlm.nih.gov/bioproject/1245633 Acknowledgements None Authors' information None References Chen Z, Azman AS, Chen X, Zou J, Tian Y, Sun R, et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat Genet. 2022;54:499–507. Gwinn M, MacCannell D, Armstrong GL. Next Generation Sequencing of Infectious Pathogens. JAMA. 2019;321:893–4. Verónica Roxana Flores-Vega. Jessica Viridiana Monroy-Molina, Luis Enrique Jiménez-Hernández, Torres AG, José Ignacio Santos-Preciado, Rosales-Reyes R. SARS-CoV-2: Evolution and Emergence of New Viral Variants. Viruses. 2022;14. Hu T, Liu X, Juan Li, Li J, Zhou H, Li C-X, et al. Bioinformatics resources for SARS-CoV-2 discovery and surveillance. Brief Bioinform. 2021;22:631–41. Kamil JP, Jeremy P, Kamil, Jeremy P, Kamil. Virus variants: GISAID policies incentivize surveillance in global south. Nature. 2021;593:341–341. Yongbing Zhou H, Zhi, Yong Teng. The outbreak of SARS-CoV‐2 Omicron lineages, immune escape and vaccine effectivity. J Med Virol. 2022. https://doi.org/10.1002/jmv.28138 . Messali S, Rondina A, Giovanetti M, Bonfanti C, Ciccozzi M, Caruso A, et al. Traceability of SARS-CoV‐2 transmission through quasispecies analysis. J Med Virol. 2023;95:e28848. Cacciabue M, Currá A, Carrillo E, König G, Gismondi MI. A beginner’s guide for FMDV quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-generation sequencing. Brief Bioinform. 2019;:bbz086. James M, Ferguson H, Gamaarachchi T, Nguyen A, Gollon S, Tong C, Aquilina-Reid, et al. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab846 . InterARTIC: an interactive web application for whole-genome nanopore sequencing analysis of SARS-CoV-2 and other viruses. Yuan Zhou L, Zhang Y-H, Xie, Wu J. Advancements in detection of SARS-CoV-2 infection for confronting COVID-19 pandemics. Lab Invest. 2021;:1–10. Sović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun. 2016;7:11307. Van Poelvoorde LAE, Delcourt T, Vuylsteke M, De Keersmaecker SCJ, Thomas I, Van Gucht S et al. A general approach to identify low-frequency variants within influenza samples collected during routine surveillance. Microb Genomics. 2022;8. Van Poelvoorde LAE, Delcourt T, Coucke W, Herman P, De Keersmaecker SCJ, Saelens X, et al. Strategy and Performance Evaluation of Low-Frequency Variant Calling for SARS-CoV-2 Using Targeted Deep Illumina Sequencing. Front Microbiol. 2021;12:747458. Tonkin-Hill G, Martincorena I, Amato R, Lawson AR, Gerstung M, Johnston I, et al. Q10 Patterns of within-host genetic diversity in SARS-CoV-2. eLife. 2021;10:e66857. Lin B, Hui J, Mao H. Nanopore Technology and Its Applications in Gene Sequencing. Biosens (Basel). 2021;11:214. Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;19:90–90. O’Donnell CR, Wang H, Dunbar WB. Error analysis of idealized nanopore sequencing: Nanoanalysis. Electrophoresis. 2013;34:2137–44. Brejová B, Boršová K, Hodorová V, Čabanová V, Gafurov A, Fričová D, et al. Nanopore sequencing of SARS-CoV-2: Comparison of short and long PCR-tiling amplicon protocols. PLoS ONE. 2021;16:e0259277. Capraru ID, Romanescu M, Anghel FM, Oancea C, Marian C, Sirbu IO, et al. Identification of Genomic Variants of SARS-CoV-2 Using Nanopore Sequencing. Medicina. 2022;58:1841. Chen Z, Erickson DL, Meng J. Polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads to improve genomic analyses. Genomics. 2021;113:1366–77. Stanojević D, Lin D, de Sessions PF, Šikić M. Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads. 2024;:2024.05.18.594796. Lee JY, Kong M, Oh J, Lim J, Chung SH, Kim J-M, et al. Comparative evaluation of Nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis. Sci Rep. 2021;11:20740. Brandt D, Simunovic M, Busche T, Haak M, Belmann P, Jünemann S, et al. Multiple Occurrences of a 168-Nucleotide Deletion in SARS-CoV-2 ORF8, Unnoticed by Standard Amplicon Sequencing and Variant Calling Pipelines. Viruses. 2021;13:1870. Rowena A, Bull, Adikari TN, Ferguson JM, Hammond JM, Stevanovski I, Beukers AG, et al. Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis. Nat Commun. 2020;11:6272. Tan K-T, Slevin MK, Meyerson M, Li H. Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres. Genome Biol. 2022;23:180. Liu Y, Rosikiewicz W, Pan Z, Jillette N, Wang P, Taghbalout A, et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome Biol. 2021;22:295. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40:11189–201. González-Recio O, Monica Gutierrez-Rivas, Gutiérrez-Rivas M, Peiró R, Peiro-Pastor R, Aguilera-Sepúlveda P, et al. Sequencing of SARS-CoV-2 genome using different nanopore chemistries. Appl Microbiol Biotechnol. 2021;105:3225–34. Liu Y, Kearney J, Mahmoud M, Kille B, Sedlazeck FJ, Treangen TJ. Rescuing low frequency variants within intra-host viral populations directly from Oxford Nanopore sequencing data. Nat Commun. 2022;13:1321. Andreu-Sánchez S, Chen L, Wang D, Augustijn HE, Zhernakova A, Fu J. A Benchmark of Genetic Variant Calling Pipelines Using Metagenomic Short-Read Sequencing. Front Genet. 2021;12. Li H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27:1157–8. Wilm A. QUAD - https://github.com/andreas-wilm/quad . 2024. Fisher MC, Alastruey-Izquierdo A, Berman J, Bicanic T, Bignell EM, Bowyer P, et al. Tackling the emerging threat of antifungal resistance to human health. Nat Rev Microbiol. 2022;20:557–71. Reisenauer A, Kahng LS, McCollum S, Shapiro L, Bacterial. DNA Methylation: Cell Cycle Regulator? J Bacteriol. 1999;181:5135–9. Butler MM, Skow DJ, Stephenson RO, Lyden PT, LaMarr WA, Foster KA. Low Frequencies of Resistance among Staphylococcus and Enterococcus Species to the Bactericidal DNA Polymerase Inhibitor N3-Hydroxybutyl 6-(3Ј-Ethyl-4Ј-Methylanilino) Uracil. Volume 46. ANTIMICROB AGENTS CHEMOTHER; 2002. Barbé L, Schaeffer J, Besnard A, Jousse S, Wurtzer S, Moulin L, et al. SARS-CoV-2 Whole-Genome Sequencing Using Oxford Nanopore Technology for Variant Monitoring in Wastewaters. Front Microbiol. 2022;13:889811. de Vries EM, Cogan NOI, Gubala AJ, Mee PT, O’Riley KJ, Rodoni BC, et al. Rapid, in-field deployable, avian influenza virus haemagglutinin characterisation tool using MinION technology. Sci Rep. 2022;12:11886. Hoenen T, Groseth A, Rosenke K, Fischer RJ, Hoenen A, Judson SD, et al. Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool. Emerg Infect Dis. 2016;22:331–4. Majidian S, Kahaei MH, De DR. Minimum error correction-based haplotype assembly: Considerations for long read data. PLoS ONE. 2020. https://doi.org/10.1371/journal.pone.0234470 . Delahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS ONE. 2021;16:e0257521. Li M. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 2004;32:5183–91. Rivara-Espasandín M, Balestrazzi L, Dufort y Álvarez G, Ochoa I, Seroussi G, Smircich P, et al. Nanopore quality score resolution can be reduced with little effect on downstream analysis. Bioinf Adv. 2022;2:vbac054. Tables Tables 1 and 2 are available in the Supplementary Files section. Additional Declarations No competing interests reported. Supplementary Files callibrationfileminus1.csv image2.jpeg Table 1 image4.jpeg Table 2 Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6226988","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":464252519,"identity":"6a35e5c2-54f1-4ff7-a9bf-e6f0d0ca11f8","order_by":0,"name":"Xavier Zair","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAx0lEQVRIiWNgGAWjYFACHiCugDAlSNByhmQtjG2kaOGX7j0m+XPencT5DcwHb/Mw1Nk1ENIiOedcsoHktmeJGw6wJVvzMBxOJqjF4EaO4QPDbYcTNzDwmEnzMBxIJugw+xs5BgcS5xwGOoz/G1BLHWEtBhJAWw42HE5sOMDDBtTCbEdQi8SdM8aGDccOG284zGZsOcfgcAJBLfyze8wkf9Qclp3f3vzwxpuKOnuCWhBxwQx2J0NiA/FaoIAIW0bBKBgFo2CkAQAcHTodZoUx+QAAAABJRU5ErkJggg==","orcid":"","institution":"National University of Singapore","correspondingAuthor":true,"prefix":"","firstName":"Xavier","middleName":"","lastName":"Zair","suffix":""},{"id":464252520,"identity":"aaf27ec9-e644-4e46-add3-22364a4d1adc","order_by":1,"name":"Andreas Wilm","email":"","orcid":"","institution":"ImmunoScape, Pte Ltd","correspondingAuthor":false,"prefix":"","firstName":"Andreas","middleName":"","lastName":"Wilm","suffix":""},{"id":464252521,"identity":"b7ba9028-9031-4f47-a7e7-8f167903fd11","order_by":2,"name":"Miles C Benton","email":"","orcid":"","institution":"Oxford Nanopore Technologies (United Kingdom)","correspondingAuthor":false,"prefix":"","firstName":"Miles","middleName":"C","lastName":"Benton","suffix":""},{"id":464252522,"identity":"febbc5ea-b91a-459e-9850-05cc95d0f926","order_by":3,"name":"Cheng Yong Tham","email":"","orcid":"","institution":"Oxford Nanopore Technologies (United Kingdom)","correspondingAuthor":false,"prefix":"","firstName":"Cheng","middleName":"Yong","lastName":"Tham","suffix":""},{"id":464252523,"identity":"ccf8c111-1b73-41eb-95ce-c93e16980d09","order_by":4,"name":"Lin Yang","email":"","orcid":"","institution":"Oxford Nanopore Technologies (United Kingdom)","correspondingAuthor":false,"prefix":"","firstName":"Lin","middleName":"","lastName":"Yang","suffix":""},{"id":464252524,"identity":"7b6bc7b1-6389-4de0-8c16-bbb1beb38a00","order_by":5,"name":"Paola Florez De Sessions","email":"","orcid":"","institution":"Oxford Nanopore Technologies (United Kingdom)","correspondingAuthor":false,"prefix":"","firstName":"Paola","middleName":"Florez","lastName":"De Sessions","suffix":""},{"id":464252525,"identity":"c6334bf4-43c3-4df0-9b6c-e90407ccc6a9","order_by":6,"name":"October Michael Sessions","email":"","orcid":"","institution":"National University of Singapore","correspondingAuthor":false,"prefix":"","firstName":"October","middleName":"Michael","lastName":"Sessions","suffix":""},{"id":464252526,"identity":"44a32042-99a8-4149-81b7-6a6fa80f46dd","order_by":7,"name":"Eng Hui Chew","email":"","orcid":"","institution":"National University of Singapore","correspondingAuthor":false,"prefix":"","firstName":"Eng","middleName":"Hui","lastName":"Chew","suffix":""},{"id":464252527,"identity":"d9f8aabe-f676-4327-ab31-980e446d4896","order_by":8,"name":"Swapnil Mishra","email":"","orcid":"","institution":"National University of Singapore","correspondingAuthor":false,"prefix":"","firstName":"Swapnil","middleName":"","lastName":"Mishra","suffix":""}],"badges":[],"createdAt":"2025-03-14 13:53:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6226988/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6226988/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":83816232,"identity":"1de06329-9988-473e-a924-56f4e5f71375","added_by":"auto","created_at":"2025-06-03 07:49:10","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":66702,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"image1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/27f747aa959fbd33fdabad4f.jpeg"},{"id":83817524,"identity":"2aed9748-af4a-48fa-b158-7cacef8f12f1","added_by":"auto","created_at":"2025-06-03 07:57:10","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":65115,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"image3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/75a5c258dffea0efb2237e4d.jpeg"},{"id":83817525,"identity":"63e17c04-7db7-4c5a-8ea0-ebec51fe07ec","added_by":"auto","created_at":"2025-06-03 07:57:10","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":111398,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"image5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/681194d29c6a8dcd669cf297.jpeg"},{"id":83816243,"identity":"0c8aa485-3519-4afd-a43e-204e11994948","added_by":"auto","created_at":"2025-06-03 07:49:10","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":79667,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"image6.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/97cc26b97300a55ca5753507.jpeg"},{"id":83817526,"identity":"241142c4-7f45-4aab-88c9-22db36c118ab","added_by":"auto","created_at":"2025-06-03 07:57:10","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":118471,"visible":true,"origin":"","legend":"\u003cp\u003eSee image above for figure legend.\u003c/p\u003e","description":"","filename":"image7.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/44ba768fee91b2afa41655d5.jpeg"},{"id":86880513,"identity":"840078eb-f0ac-4719-8b0b-083baa4259bd","added_by":"auto","created_at":"2025-07-16 16:16:59","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1008518,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/5f3aaf06-6243-4129-b120-ad3946282dd1.pdf"},{"id":83816237,"identity":"68132b88-7ca2-4ed1-b1a6-e8628b3581a3","added_by":"auto","created_at":"2025-06-03 07:49:10","extension":"csv","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":340,"visible":true,"origin":"","legend":"","description":"","filename":"callibrationfileminus1.csv","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/442e67e3da0ca81e13785a21.csv"},{"id":83816246,"identity":"02d026d1-c5a5-4bfc-8732-d98575fe9dba","added_by":"auto","created_at":"2025-06-03 07:49:10","extension":"jpeg","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":177728,"visible":true,"origin":"","legend":"\u003cp\u003eTable 1\u003c/p\u003e","description":"","filename":"image2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/313dae65a753ed82fda7de0a.jpeg"},{"id":83816240,"identity":"392158d3-1b54-47c7-8723-9a9c75ea705b","added_by":"auto","created_at":"2025-06-03 07:49:10","extension":"jpeg","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":49298,"visible":true,"origin":"","legend":"\u003cp\u003eTable 2\u003c/p\u003e","description":"","filename":"image4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-6226988/v1/8202f344798de4d60c1e62e5.jpeg"}],"financialInterests":"No competing interests reported.","formattedTitle":"Sub-consensus haploid variant calling in Long-read sequencing technology","fulltext":[{"header":"Introduction","content":"\u003cp\u003eWith numerous use cases, ranging from cancer screening to evolutionary studies, Next generation sequencing (NGS) has become instrumental in epidemiology in recent years [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. During the COVID-19 pandemic, whole-genome sequencing (WGS) emerged as a popular method in epidemiological surveillance [\u003cspan additionalcitationids=\"CR3\" citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. At of the time of writing, approximately 17\u0026nbsp;million genome sequences have been uploaded to The Global Initiative on Sharing All Influenza Data (GISAID) alone. This unprecedented level of data generation and sharing is expected to increase further with the growing accessibility of NGS, particularly in future outbreaks. Generated genomes have been invaluable in tracking mutations and viral evolution, influencing policy and interventions such as vaccine development [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan additionalcitationids=\"CR6 CR7\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Data analysis pipelines and workflows for NGS have advanced significantly, producing accurate viral consensus genomes [\u003cspan additionalcitationids=\"CR10\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. However, beyond consensus level analysis, there has been renewed interest in investigating minor variants as well. This area of research has been emerging as a means to elucidate viral micro-evolution, drug resistance emergence, and immune evasion mechanisms relevant to vaccine efficacy [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan additionalcitationids=\"CR13\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOxford Nanopore Technologies (ONT) sequencing technology has captured a significant share of the next-generation sequencing market over the last years. Factors such as long-read capabilities, cost-effectiveness, and ease of use have contributed to this growth [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. It has been estimated that about a fifth of SARS-CoV2 NGS reads uploaded to online data repositories were generated on ONT platforms [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. However, accurately characterising low-frequency variants is challenging, particularly for longer read technologies due to increased noise levels [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Polishing methods have proven viable for increasing overall accuracy substantially. However, they often demand laborious analysis and may compromise depth, overlooking rare variants [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], and in some cases, require additional short read data [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Emerging error correction tools, such as HERRO, are not able to be applied to shorter reads (\u0026lt;\u0026thinsp;4,096bp), limiting their application to data sets with longer reads at this point in time [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Some pipelines resort to blacklisting complex genomic regions to reduce noise, but this approach can compromise resolution, and existing variant analysis tools often exhibit poor single nucleotide variant (SNV) discovery at low frequencies [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Widely used pipelines like ARTIC-ONT and COG-UK focus on accurate consensus genome generation, limiting variant population analysis to experimental modes [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Instead of correcting errors at the read level, some variant calling toolkits filter variants post-calling, targeting error patterns inherent to long-read sequencing [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. However, these solutions are often genus specific and generalizing to other domains can be challenging.\u003c/p\u003e \u003cp\u003ePresently, tools for precise identification of rare and sub-consensus variants in haploid genomes from fourth generation NGS data are lacking. Rare variant calling tools, such as LoFreq, were developed for short read data of higher quality and have not undergone extensive validation on long read platforms [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. Polishing to increase read quality is not feasible since it may remove the low-frequency variants of interests. Traditionally, variant caller validation studies insert variants in silico, artificially modifying bases in NGS datasets for benchmarking purposes [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. While this method offers flexibility, it may fail to accurately represent real biological events and cannot account for the sequencing biases intrinsic to each technology [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. The alternative approach to benchmark variant calling performance by comparing shared variants may be inaccurate due to inherent errors in Illumina reads [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. Relying on cross-concordance, this method frequently misses errors in Illumina reads, as the ground truth is uncertain.\u003c/p\u003e \u003cp\u003eThis study aims to validate the use of the rare variant caller, LoFreq, on long-read data utilizing a curated truth set with known variant compositions. Specifically, we intend to benchmark the recall of variants in shorter plasmid reads, which emulate viral genomes, and in full-length \u003cem\u003eEscherichia coli\u003c/em\u003e genomes to enable indel recall analysis. Additionally, we propose a calibration method aimed at reducing base quality (or Phred) scores to enhance accuracy in long-read datasets. Our findings indicate that while LoFreq remains useful for low-frequency variant calling in long-read chemistry, the high rate of false discoveries is a significant drawback. However, with quality score adjustment, rare variants can still be detected confidently, while significantly reducing false discoveries across our datasets. As nanopore sequencing becomes more cost-effective and portable, the proportion of pathogen data generated on these platforms is anticipated to rise in future outbreaks. This study aims to address the lack of sub-consensus variant calling tools for long-read datasets by proposing workarounds to effectively apply LoFreq to nanopore reads.\u003c/p\u003e "},{"header":"Methods","content":" \u003cp\u003e \u003cb\u003e[Truth set construction]\u003c/b\u003e Construction of truth set libraries followed two distinct approaches. The first approach utilised plasmids on a pUNO backbone that contained the Delta variant SARS-CoV-2 spike gene (InVivogen, p1-spike-v8, 7179 base-pairs). Two distinct chimeric plasmids were generated by replacing a 241-base-pair segment (between nucleotides 1,750-1,990) with the original Wuhan-wet market (wild-type, Clade 19) and Omicron variant spike genes. Chimeric inserts from wild-type and Omicron variants were PCR-amplified from their donor plasmids. Inserts from recipient plasmids were excised via restriction digestion with \u003cem\u003eKpnI\u003c/em\u003e and \u003cem\u003eNotI\u003c/em\u003e (New England Biolabs), and donor amplicons were incorporated using Gibson assembly (New England Biolabs, NEBuilder\u0026reg; HiFi DNA Assembly Master Mix). Comparative alignment of the plasmids indicates an 85-base-pair difference. Of the 85 variants, 70 display a single nucleotide difference across all three spike plasmids. Conversely, the remaining 15 positions offer more than two possibilities, for instance, A, T, and G from Delta, Omicron, and wild-type respectively. When these three plasmids are combined, the cumulative count of detectable variants adds up to 100 (70\u0026thinsp;+\u0026thinsp;15\u0026times;2). This design evaluates both the variant caller's recognition of distinct variants and its differentiation ability among three nucleotide options at specific positions. Sanger sequencing verified the accurate integration of chimeric sequences. Plasmids were expanded natively on K-12 DH5α (New England Biolabs) strains and purified by Miniprep (QIAprep Spin Miniprep Kit, Qiagen).\u003c/p\u003e \u003cp\u003eThe alternate strategy targeted the sequencing of entire genomes from two \u003cem\u003eEscherichia coli\u003c/em\u003e K12 strains. We selected two strains for this purpose, HB101 (Life Technologies, Max Efficiency HB101 Competent Cell) and DH5α (NEB\u0026reg; 5-alpha Competent \u003cem\u003eE. coli\u003c/em\u003e High Efficiency), which showed a high pairwise similarity of 95.2%. Between the two selected E.coli K-12 strains, a total of 1496 substitutional variants, 13770 insertions, and 47990 deletions were identified through pairwise alignment, with an expected frequency of 10 percent. Each strain was grown overnight in Lennox LB broth base (ThermoFisher Scientific) at 37\u0026deg;C and genomes were subsequently extracted and purified using the PureLink\u0026trade; Genomic DNA Mini Kit (Invitrogen).\u003c/p\u003e \u003cp\u003e \u003cb\u003e[Sequencing and flow cell chemistry]\u003c/b\u003e To evaluate the performance of the updated pore chemistry, libraries were sequenced on R9.4.1 and R10.4.1 flow cells. Spike plasmid ratios were precisely stoichiometrically mixed, targeting allele frequencies (AF) of 0.01, 0.05, 0.1, 0.15 and 0.19. Bacterial genomes were mixed exclusively at a frequency of 0.1. R9.4.1 libraries were prepared using ONTs ligation sequencing kit SQT-LSK109 and barcoded with native barcode expansions 1\u0026ndash;12 (EXP-NBD104). Libraries for R10.4.1 flow cells were generated using the Native Barcoding Kit 24 (SQK-NBD114.24). Each of the three plasmids received individual barcoding before undergoing sequencing on a singular flow cell. Each bacterial strain was sequenced on a dedicated flow cell, while a separate flow cell was used for the mixed sample preparation. The flow cells were run for 72 hours to achieve maximal depth on ONT\u0026rsquo;s GridION Mk1 system.\u003c/p\u003e \u003cp\u003e \u003cb\u003e[Bioinformatic analysis]\u003c/b\u003e Raw sequence files (fast5 and pod5) were base-called using Dorado (v0.5.3) using the super accurate model (v4.2.0). \u0026lsquo;Passed\u0026rsquo; reads were filtered using the NanoFilt (v2.8.0) package, with a minimum read quality cut-off of 10. Adaptors and barcodes were trimmed using Porechop (v0.2.3_seqan2.1.1), using the -auto setting. A consensus for the bacterial genomes was generated using de novo assembly with the Flye (v2.9) assembler, serving as a baseline for variant identification between K-12 strains. Consensus genomes were aligned pairwise using MAFFT (v7.310) with the --auto setting. The resulting de novo reference alignments (genomes) and in-silico references (plasmids) were annotated using the Geneious Prime suite (Java Version 11.0.20.1\u0026thinsp;+\u0026thinsp;1) to generate a list of expected variants used for benchmarking. Reference-based alignment was carried out using Minimap2 (v2.17-r941), with the \u0026lsquo;-ax map-ont\u0026rsquo; parameters. Samtools (v1.6) was used for the conversion of SAM to BAM, as well as for indexing and sorting of alignment files.\u003c/p\u003e \u003cp\u003eTo ensure consistency, reads were normalised to standard depths, as depth was suspected to be a critical variable in variant calling. Datasets were normalised to a consistent depth, set by the dataset with the minimum depth. To mitigate sampling error, all filtered and trimmed reads underwent twelve rounds of subsampling across all sequencing chemistries, facilitated by the Rasusa (v0.6.1) script, with the reference length as the input. Bacterial genome datasets necessitated a more targeted, reference-based method of depth normalisation due to their significantly longer genomes. This was achieved using the Jvarkit toolkit (v4b65b20b2) with the following parameters, substituting SAM alignments as input:\u003c/p\u003e \u003cp\u003ejava -jar ./jvarkit.jar biostar154220 -d\u0026thinsp;\u0026lt;\u0026thinsp;target_depth(int)\u0026gt; -o\u0026thinsp;\u0026lt;\u0026thinsp;output.sam\u0026thinsp;\u0026gt;\u0026thinsp;\u0026lt;\u0026thinsp;input.sam\u0026gt;\u003c/p\u003e \u003cp\u003eThis operation removes all reads exceeding the target depth, retaining only those reads with the highest mapping quality. For indel analysis on bacterial genomes, a uniform indel quality of 50 was added to BAM files using LoFreq indelqual with parameters -u 50. Variant calling with LoFreq (v2.1.3.1), was run with -B mode, thereby disabling BAQ computation, which is optimized for shorter Illumina reads [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Alignment quality, coverage and depth information were verified with the Qualimap (v.2.2.2-dev) toolkit, using the bamqc parameter. Longshot (v0.4.1) and Clair3 (v0.1-r10) were also used to call variants, which will be briefly elaborated in the results.\u003c/p\u003e \u003cp\u003e \u003cb\u003e[Quality score analysis and adjustment]\u003c/b\u003e Reads from clonal plasmids were trimmed and filtered according to the previously mentioned settings. Random extraction of 1\u0026nbsp;million bases from the processed FASTQ files was performed over 12 iterations using Rasusa\u0026rsquo;s base extraction parameter (-b). Using the reformat.sh function under qchist parameters in the BBmaps suite of toolkits (v38.18), Phred scores were extracted from subsampled FASTQ files. Phred distribution graphs were created from the resultant CSV files (Fig.\u0026nbsp;5a). Minimap2 was used once again to align subsampled FASTQ files to reference genomes, applying the above-mentioned settings. SAM files were reanalyzed using the reformat.sh function under the qahist parameters and CSV files containing encoded and empirical Phred values were used to generate QvsQ plots (Fig.\u0026nbsp;3b, 3c).\u003c/p\u003e \u003cp\u003eBase quality (Phred) score reduction was carried out using a custom Nim binary (QUAD v0.2) [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Input CSV files formatted with \"before\" and \"after\" values were used to modify base quality scores by substituting specific integer Phred values with their corresponding adjustments. From the clonal plasmid QvsQ plots, we derived best fit curves\u0026mdash;quadratic, third-degree polynomial, and Gaussian\u0026mdash; which were used to calibrate the corresponding values. Uniform scalar reductions were applied to all Phred scores, where each score was decreased by a specified value (e.g., a reduction of 1 lowered all scores by 1, while a reduction of 2 lowered all scores by 2), with ranges of 1 to 8 for genomic datasets and 1 to 10 for plasmid datasets. Calibration CSV files associated with each curve can be found in the supplementary materials. A control set was also established, maintaining unchanged Phred values to rule out file manipulation or corruption as causes of reduced false positives, attributing the effect to actual change. The adjusted BAM files were indexed using samtools, followed by variant calling with LoFreq under the previously described parameters. All graphs and figures were generated with GraphPad Prism (v8.0.1).\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e[Sensitivity and false discovery]\u003c/b\u003e The use of LoFreq on Nanopore data resulted in false calls across all datasets. To quantify the extent of false discovery between nanopore chemistries, reads were sub-sampled to fixed depths. Within clonal plasmid libraries, which are characterised by a single population of plasmids devoid of variation, LoFreq reported a significant number of variants. Across 12 replicates at a uniform depth of 2,000, we noted averages of 8.25 (SD 1.48) and 2.00 (SD 1.04) false calls on R9.4.1 and R10.4.1 flow cells, respectively (Fig.\u0026nbsp;1). For mixed plasmid datasets at the same depth, the average false positive calls were 8.67 (SD 1.48) for R9.4.1 and 4.88 (SD 1.08) for R10.4.1. This contributed an FDR of 12.9% and 6.3% on R9.4.1 and R10.4.1 flow cells respectively (Table\u0026nbsp;1). Within the genomic datasets containing variants with AF of 0.1, the average FDR varied drastically by depth (Fig.\u0026nbsp;2a). This effect was also observed in plasmid datasets, although to a much smaller extent, presumably due to smaller genome size. On R9.4.1 flow cells, we observed a gradual increase in FDR as depth increased. For R10.4.1, however, we observed a constant decrease in the FDR with every unit of increase in depth, eventually plateauing at a false discovery rate of about 25%.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e[Recall per chemistry]\u003c/b\u003e Within plasmid datasets, LoFreq detected all variants at frequencies of 0.1, 0.15, and 0.19 when using R10.4.1 chemistry, although its recall for variants at allele frequency of 0.05 was notably low. For R9.4.1 flow cells, the recall rates for 0.1 and 0.15 frequency variants were reduced to 90.7% and 95.1%, respectively, while the recall for 0.05 variants improved to 41.1% (Table\u0026nbsp;1). In detecting variants with AF of 0.1 in bacterial genomes, R9.4.1 flow cells showed a low recall rate of less than 5%, whereas R10.4.1 chemistry achieved a high recall rate of 88.7% at identical depths of 300X. Indel discovery was comparable between flow cells, whereas R10.4.1 demonstrated superior recall for insertional variants specifically at higher sequencing depths (Fig.\u0026nbsp;2).\u003c/p\u003e \u003cp\u003e \u003cb\u003e[Phred quality score encoding and reduction]\u003c/b\u003e Analysis of per-base Phred quality score distribution revealed differences between chemistries. Both chemistries had a high modal score, with R9.4.1 at 31 and R10.4.1 at 38. Notably, R10.4.1 reads exhibited a significant number of bases with a score of 50, which would be the true modal score. However, we focus on the scores within the range of 31\u0026ndash;38, as these offer a more robust estimate of central tendency (Fig.\u0026nbsp;3a). Our QvsQ (Fig.\u0026nbsp;3b, 3c) plots showed a slight increase in quality scores starting at Phred scores of 12 for R9.4.1 chemistry, whereas R10.4.1 chemistry began deviating around Phred scores of 20. While R10.4.1 chemistry maintained relatively accurate Phred scores between 40\u0026ndash;50, R9.4.1 showed substantial deviation in this range (Fig.\u0026nbsp;3b). Overall, both chemistries exhibited Phred score deviation from expected, with R10.4.1 showing greater accuracy compared to R9.4.1, indicating a need for adjustment.\u003c/p\u003e \u003cp\u003eWe explored the impact of fitted calibration as well as scalar reductions in Phred scores on the occurrence of false positives. For this, we focused on the R9.4.1 mixed plasmid dataset to assess potential approaches. Scalar reduction demonstrated the most effective suppression of false positives compared to all fitted calibration methods (Table\u0026nbsp;2). Consequently, we aimed to determine the magnitude of Phred score reduction necessary to eliminate false positives in the clonal plasmid datasets. Given that these datasets are clonal, any identified variants could be immediately classified as false positives. A quality offset of one led to an immediate reduction in false calls, while a Phred reduction of seven completely prevented false calls on R9.4.1 pores, while R10.4.1 required a smaller Phred reduction of six to eliminate all false positives (Fig.\u0026nbsp;1). In the unsampled clonal R10.4.1 datasets with average depths of 50,000 (not shown), a Phred reduction of 6 failed to eliminate all variants, necessitating a Phred reduction of seven. As the retention of true positives is of higher importance, we applied further scalar reduction to datasets with mixed populations, expanded in the following sections.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e[Effect of true calls in mixed plasmid datasets]\u003c/b\u003e The impact of scalar Phred reduction on false call elimination and true call retention exhibited a non-linear pattern and varied across different chemistries and variant populations. Despite significant Phred reductions of up to ten, variants with an AF of 0.19 in R9.4.1 chemistry remained unaffected, while discovery rates for variants at 0.15 frequency and lower were impacted (Table\u0026nbsp;1). The cumulative recall loss for 0.15 frequency variants was about 9.77% with a Phred reduction of ten, demonstrating a non-linear trend, as reductions between \u0026minus;\u0026thinsp;1 and \u0026minus;\u0026thinsp;7 only caused a 0.57% loss. Unfortunately, we observed a larger recall loss as variant frequency decreased, with variants of AF of 0.1 and 0.05 losing about 19.1% and 36.7% recall respectively. In contrast, variant discovery for frequencies of 0.1 and above remained unaffected on R10.4.1 flow cells. A reduction of about 1.44% in 0.05 variant recall was observed; however, because the baseline variant discovery was low, this led to only a 0.2-fold reduction in total discovery (Table\u0026nbsp;1). Positions with multiple possible variants (described in our methods) were most susceptible to recall loss.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e[Effect of true calls in mixed bacterial genomic datasets]\u003c/b\u003e Reducing phred scores up to eight had a minor impact on the recall of substitutional variants, particularly in R10.4.1 chemistry, since the baseline recall for R9.4.1 was less than 5% (Fig.\u0026nbsp;4).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eConversely, Phred score reductions markedly decreased indel detection in both chemistries. As observed in the plasmid datasets, the reduction in false positives was non-linear, particularly in the R10.4.1 dataset (Fig.\u0026nbsp;5). Regardless, reduction of up to eight in Phred scores led to a 70\u0026ndash;75% decrease in false positives across both chemistries. However, some false positives persisted on both chemistries even at the max Phred score reduction of eight.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe analysis of rare and sub-consensus variants in haploid pathogens such as bacteria, fungi, and viruses has a profound impact on our understanding of these organisms [\u003cspan additionalcitationids=\"CR34\" citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. Although LoFreq is available for illumina data, there are currently no adequate solutions for accurate sub-consensus variant calling on ONT reads. Furthermore, the efficacy of LoFreq on long-read data has yet to be benchmarked, yet it continues to be utilized in practice [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. ONT sequencing is particularly appealing for pathogen whole-genome sequencing due to its portability and low cost. This accessibility and cost-effectiveness have facilitated field-based genome generation during novel pathogen outbreaks [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. Furthermore, off the field, nanopore technology retains an advantage in haplotype variant studies because of its longer read lengths [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]. In this study, we developed curated truth sets that allowed us to assess sensitivity and precision of the short-read variant caller, LoFreq, when applied to long-read technology. Our findings indicate that the application of LoFreq to long-read technology is feasible, demonstrating strong recall rates for rare variants. However, high false discovery rates render it unsuitable for immediate use, necessitating further calibration.\u003c/p\u003e \u003cp\u003eOur study showed that the FDR of LoFreq variant calling can be reduced by adjusting the Phred scores of long reads. Commonly used nanopore variant callers employ techniques such as haplotype phasing and windowed read analysis to mitigate errors. However, since none of these methods are specifically designed as rare variant callers like LoFreq, direct comparisons would be inappropriate. By default, LoFreq employs base alignment quality (BAQ) to enhance the specificity of variant calling. This process is critical, particularly in datasets with higher error rates. Unfortunately, BAQ calculations are inherently slow and optimized for Illumina reads, thus rendering them unsuitable for longer reads. Therefore, we have chosen to investigate adjustments to Phred quality scores to address the reduced specificity.\u003c/p\u003e \u003cp\u003eAdjustments to quality scores in sequencing reads have been a topic of discussion for several decades, with proposals targeting Phred scores within Sanger sequencing reads [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e, \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]. However, our attempts to calibrate quality scores using conventional methods have resulted in varied outcomes, ranging from a minor reduction in false discovery to a detrimental increase in some cases. We believe that the observed increase in false discoveries is due to underfitting, which can artificially inflate Phred scores in certain ranges. Notably, raising quality scores of bases within the critical Phred range of 10\u0026ndash;25 led to significant increases in false discoveries as noted in the quadratic fit. In contrast, curves with more accurate alignments that only reduced Phred scores resulted in a net decrease in false positive rates (Table\u0026nbsp;2). Unfortunately, we observed that while the curves that fit better were more accurate, they did not lead to significant reductions in false positive occurrences. Overall, the effect of curve fitting methods on reducing Phred scores was too subdued. This was also evident when using the auto-calibration method provided by BBMap: toolkits such as BBMap (BBMap - Bushnell B. - sourceforge.net/projects/bbmap/) offer automatic correction of reported Phred values based on empirical alignment-derived calculations; however, we noted only minor improvements in false positive reduction. Our observations are in line with Rivara-Espasand\u0026iacute;n et al.\u0026rsquo;s findings, that concluded quantizing reads that range from 0\u0026ndash;93 into four quantised values of 5, 12, 18, and 24 had very little overall effect on variant identification in human and microbial genomes [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]. On the other hand, a single scalar reduction of Phred scores proved more effective at reducing FDR than any of the tested fits. Further reduction of Phred scores led to additional decreases in FDR, resulting in the complete removal of false positives in some of our datasets.\u003c/p\u003e \u003cp\u003eMost importantly, our findings demonstrate that while Phred score reduction effectively reduces false positives, true positives are largely retained. The recall of variants at allelic frequency of 0.1 and above in both R10.4.1 and R9.4.1 flow cells chemistries were maintained even at aggressive Phred reduction of up to 8. Unfortunately, variants at 5% or less are more adversely affected by Phred score reduction; thus, making this method unsuitable for rare variant identification. Additionally, indel identification is diminished with quality score reduction, making this approach less suitable for applications that rely on accurate indel detection. However, for genomes with little structural variations, such as those found in viral or mutator bacteria, Phred score reduction can be effective. In conclusion, our method offers an alternative approach for utilizing the rare variant caller LoFreq with nanopore data, particularly given that most tools in the nanopore space prioritize base calling accuracy or alignment refinement. We suggest that further exploration of quality score adjustments may enhance the adaptability of existing tools for nanopore sequencing applications. However, until a dedicated low-frequency variant caller for nanopore sequencing becomes available, our method remains an effective means of accurately identifying low-frequency variants in nanopore long reads.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAF \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Allele frequency\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eBAQ\u0026nbsp; \u0026nbsp;\u0026nbsp;Base alignment quality\u003c/p\u003e\n\u003cp\u003eFDR\u0026nbsp; \u0026nbsp;\u0026nbsp;False discovery rate\u003c/p\u003e\n\u003cp\u003eNGS \u0026nbsp; \u0026nbsp;Next Generation Sequencing\u003c/p\u003e\n\u003cp\u003eONT\u0026nbsp; \u0026nbsp;\u0026nbsp;Oxford Nanopore Technologies\u003c/p\u003e\n\u003cp\u003ePCR\u0026nbsp; \u0026nbsp;\u0026nbsp;Polymerase Chain Reaction\u003c/p\u003e\n\u003cp\u003eQvsQ\u0026nbsp;\u0026nbsp;Empirical Phred vs. encoded Phred scores\u003c/p\u003e\n\u003cp\u003eSD \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;Standard deviation\u003c/p\u003e\n\u003cp\u003eSNV\u0026nbsp; \u0026nbsp;\u0026nbsp;Single nucleotide variant\u003c/p\u003e\n\u003cp\u003eSUP \u0026nbsp; \u0026nbsp;\u0026nbsp;Super accurate\u003c/p\u003e\n\u003cp\u003eVCF \u0026nbsp; \u0026nbsp;\u0026nbsp;Variant call file\u003c/p\u003e\n\u003cp\u003eWGS \u0026nbsp; Whole Genome Sequencing\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eEthics approval and consent to participate\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003eConsent for publication\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003eFunding\u003c/p\u003e\n\u003cp\u003eNone\u003c/p\u003e\n\u003cp\u003eAuthors\u0026apos; contributions\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eX.Zair generated the main body manuscript text as well as analysis, interpretation, preparation of figures\u003c/p\u003e\n\u003cp\u003eAW contributed to sustained support in the use of LoFreq on data specific to the manuscript, as well as the development of QUAD and the support of its use.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMCB, CYT, LY and PFdS - authors affiliated with Oxford Nanopore Technologies assisted with data generation, technical support on data generation on Nano pore technology. And finally formatting and editing of final manuscript.\u003c/p\u003e\n\u003cp\u003eOMS, CEH, and SM are the corresponding author\u0026apos;s thesis advisors, contributing to project direction, refinement and progress assessment.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSM is the corresponding authors main thesis advisor and Principal Investigator.\u003c/p\u003e\n\u003cp\u003eAll authors have reviewed the manuscript.\u003c/p\u003e\n\u003cp\u003eData Availability\u003c/p\u003e\n\u003cp\u003eRaw FASTQ truth set reads of both plasmid and genome sequences were uploaded to NCBI Sequence Read Archive (SRA) under bioproject PRJNA1245633 at: https://www.ncbi.nlm.nih.gov/bioproject/1245633\u0026nbsp;\u003c/p\u003e\n\n\u003cp\u003eAcknowledgements\u003c/p\u003e\n\u003cp\u003eNone\u003c/p\u003e\n\u003cp\u003eAuthors\u0026apos; information\u003c/p\u003e\n\u003cp\u003eNone\u003cbr\u003e \u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eChen Z, Azman AS, Chen X, Zou J, Tian Y, Sun R, et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat Genet. 2022;54:499\u0026ndash;507.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGwinn M, MacCannell D, Armstrong GL. Next Generation Sequencing of Infectious Pathogens. JAMA. 2019;321:893\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVer\u0026oacute;nica Roxana Flores-Vega. Jessica Viridiana Monroy-Molina, Luis Enrique Jim\u0026eacute;nez-Hern\u0026aacute;ndez, Torres AG, Jos\u0026eacute; Ignacio Santos-Preciado, Rosales-Reyes R. SARS-CoV-2: Evolution and Emergence of New Viral Variants. Viruses. 2022;14.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHu T, Liu X, Juan Li, Li J, Zhou H, Li C-X, et al. Bioinformatics resources for SARS-CoV-2 discovery and surveillance. Brief Bioinform. 2021;22:631\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKamil JP, Jeremy P, Kamil, Jeremy P, Kamil. Virus variants: GISAID policies incentivize surveillance in global south. Nature. 2021;593:341\u0026ndash;341.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYongbing Zhou H, Zhi, Yong Teng. The outbreak of SARS-CoV‐2 Omicron lineages, immune escape and vaccine effectivity. J Med Virol. 2022. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/jmv.28138\u003c/span\u003e\u003cspan address=\"10.1002/jmv.28138\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMessali S, Rondina A, Giovanetti M, Bonfanti C, Ciccozzi M, Caruso A, et al. Traceability of SARS-CoV‐2 transmission through quasispecies analysis. J Med Virol. 2023;95:e28848.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCacciabue M, Curr\u0026aacute; A, Carrillo E, K\u0026ouml;nig G, Gismondi MI. A beginner\u0026rsquo;s guide for FMDV quasispecies analysis: sub-consensus variant detection and haplotype reconstruction using next-generation sequencing. Brief Bioinform. 2019;:bbz086.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJames M, Ferguson H, Gamaarachchi T, Nguyen A, Gollon S, Tong C, Aquilina-Reid, et al. Bioinformatics. 2021. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/bioinformatics/btab846\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/btab846\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. InterARTIC: an interactive web application for whole-genome nanopore sequencing analysis of SARS-CoV-2 and other viruses.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYuan Zhou L, Zhang Y-H, Xie, Wu J. Advancements in detection of SARS-CoV-2 infection for confronting COVID-19 pandemics. Lab Invest. 2021;:1\u0026ndash;10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun. 2016;7:11307.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Poelvoorde LAE, Delcourt T, Vuylsteke M, De Keersmaecker SCJ, Thomas I, Van Gucht S et al. A general approach to identify low-frequency variants within influenza samples collected during routine surveillance. Microb Genomics. 2022;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan Poelvoorde LAE, Delcourt T, Coucke W, Herman P, De Keersmaecker SCJ, Saelens X, et al. Strategy and Performance Evaluation of Low-Frequency Variant Calling for SARS-CoV-2 Using Targeted Deep Illumina Sequencing. Front Microbiol. 2021;12:747458.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTonkin-Hill G, Martincorena I, Amato R, Lawson AR, Gerstung M, Johnston I, et al. Q10 Patterns of within-host genetic diversity in SARS-CoV-2. eLife. 2021;10:e66857.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin B, Hui J, Mao H. Nanopore Technology and Its Applications in Gene Sequencing. Biosens (Basel). 2021;11:214.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;19:90\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eO\u0026rsquo;Donnell CR, Wang H, Dunbar WB. Error analysis of idealized nanopore sequencing: Nanoanalysis. Electrophoresis. 2013;34:2137\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrejov\u0026aacute; B, Boršov\u0026aacute; K, Hodorov\u0026aacute; V, Čabanov\u0026aacute; V, Gafurov A, Fričov\u0026aacute; D, et al. Nanopore sequencing of SARS-CoV-2: Comparison of short and long PCR-tiling amplicon protocols. PLoS ONE. 2021;16:e0259277.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCapraru ID, Romanescu M, Anghel FM, Oancea C, Marian C, Sirbu IO, et al. Identification of Genomic Variants of SARS-CoV-2 Using Nanopore Sequencing. Medicina. 2022;58:1841.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen Z, Erickson DL, Meng J. Polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads to improve genomic analyses. Genomics. 2021;113:1366\u0026ndash;77.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStanojević D, Lin D, de Sessions PF, Šikić M. Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads. 2024;:2024.05.18.594796.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee JY, Kong M, Oh J, Lim J, Chung SH, Kim J-M, et al. Comparative evaluation of Nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis. Sci Rep. 2021;11:20740.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrandt D, Simunovic M, Busche T, Haak M, Belmann P, J\u0026uuml;nemann S, et al. Multiple Occurrences of a 168-Nucleotide Deletion in SARS-CoV-2 ORF8, Unnoticed by Standard Amplicon Sequencing and Variant Calling Pipelines. Viruses. 2021;13:1870.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRowena A, Bull, Adikari TN, Ferguson JM, Hammond JM, Stevanovski I, Beukers AG, et al. Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis. Nat Commun. 2020;11:6272.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTan K-T, Slevin MK, Meyerson M, Li H. Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres. Genome Biol. 2022;23:180.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Rosikiewicz W, Pan Z, Jillette N, Wang P, Taghbalout A, et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome Biol. 2021;22:295.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40:11189\u0026ndash;201.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGonz\u0026aacute;lez-Recio O, Monica Gutierrez-Rivas, Guti\u0026eacute;rrez-Rivas M, Peir\u0026oacute; R, Peiro-Pastor R, Aguilera-Sep\u0026uacute;lveda P, et al. Sequencing of SARS-CoV-2 genome using different nanopore chemistries. Appl Microbiol Biotechnol. 2021;105:3225\u0026ndash;34.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Y, Kearney J, Mahmoud M, Kille B, Sedlazeck FJ, Treangen TJ. Rescuing low frequency variants within intra-host viral populations directly from Oxford Nanopore sequencing data. Nat Commun. 2022;13:1321.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAndreu-S\u0026aacute;nchez S, Chen L, Wang D, Augustijn HE, Zhernakova A, Fu J. A Benchmark of Genetic Variant Calling Pipelines Using Metagenomic Short-Read Sequencing. Front Genet. 2021;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27:1157\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWilm A. QUAD - \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/andreas-wilm/quad\u003c/span\u003e\u003cspan address=\"https://github.com/andreas-wilm/quad\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. 2024.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFisher MC, Alastruey-Izquierdo A, Berman J, Bicanic T, Bignell EM, Bowyer P, et al. Tackling the emerging threat of antifungal resistance to human health. Nat Rev Microbiol. 2022;20:557\u0026ndash;71.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReisenauer A, Kahng LS, McCollum S, Shapiro L, Bacterial. DNA Methylation: Cell Cycle Regulator? J Bacteriol. 1999;181:5135\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eButler MM, Skow DJ, Stephenson RO, Lyden PT, LaMarr WA, Foster KA. Low Frequencies of Resistance among Staphylococcus and Enterococcus Species to the Bactericidal DNA Polymerase Inhibitor N3-Hydroxybutyl 6-(3Ј-Ethyl-4Ј-Methylanilino) Uracil. Volume 46. ANTIMICROB AGENTS CHEMOTHER; 2002.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarb\u0026eacute; L, Schaeffer J, Besnard A, Jousse S, Wurtzer S, Moulin L, et al. SARS-CoV-2 Whole-Genome Sequencing Using Oxford Nanopore Technology for Variant Monitoring in Wastewaters. Front Microbiol. 2022;13:889811.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ede Vries EM, Cogan NOI, Gubala AJ, Mee PT, O\u0026rsquo;Riley KJ, Rodoni BC, et al. Rapid, in-field deployable, avian influenza virus haemagglutinin characterisation tool using MinION technology. Sci Rep. 2022;12:11886.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoenen T, Groseth A, Rosenke K, Fischer RJ, Hoenen A, Judson SD, et al. Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool. Emerg Infect Dis. 2016;22:331\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMajidian S, Kahaei MH, De DR. Minimum error correction-based haplotype assembly: Considerations for long read data. PLoS ONE. 2020. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0234470\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0234470\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDelahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS ONE. 2021;16:e0257521.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi M. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 2004;32:5183\u0026ndash;91.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRivara-Espasand\u0026iacute;n M, Balestrazzi L, Dufort y \u0026Aacute;lvarez G, Ochoa I, Seroussi G, Smircich P, et al. Nanopore quality score resolution can be reduced with little effect on downstream analysis. Bioinf Adv. 2022;2:vbac054.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003eTables 1 and 2 are available in the Supplementary Files section.\u003c/p\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6226988/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6226988/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eNext-generation sequencing (NGS) has become crucial in epidemiology, particularly for tracking viral evolution during outbreaks. While Oxford Nanopore Technologies (ONT) sequencing has gained popularity due to its long-read capabilities and cost-effectiveness, accurately identifying low-frequency variants in long-read data remains challenging. LoFreq, a commonly used variant caller for identifying rare variants in haploid datasets, was developed for short reads. This study aims to validate the use of LoFreq on long-read data and propose a calibration method to enhance accuracy.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe constructed truth sets using three plasmids containing SARS-CoV-2 spike genes (7179 bases) with 100 SNVs between them, as well as full-length Escherichia coli genomes. Libraries were sequenced on R9.4.1 and R10.4.1 flow cells. Recall was benchmarked with LoFreq and compared between flow cell chemistries and library size. We also developed a method to adjust base quality (Phred) scores to improve accuracy in long-read datasets.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eLoFreq demonstrated high sensitivity for detecting variants at allelic frequencies as low as 0.1, particularly with R10.4.1 chemistry. However, false discovery rates (FDR) were significant, varying by sequencing depth and chemistry. R10.4.1 showed superior performance in both sensitivity and FDR compared to R9.4.1. We propose a Phred score calibration method that significantly reduced false positives while maintaining recall rates in specific cases. However, it was found to be unsuitable for recalling variants at less than 10% and for structural variant discovery as they suffered significant recall loss.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eWhile LoFreq remains useful for low-frequency variant calling in long-read data, high false discovery rates on either flow cell chemistry make its direct use on long-read data inadvisable. Our proposed quality score adjustment allows for improved detection of sub-consensus variants while reducing false discoveries. Though more fine-tuning is required for broader applicability, these findings address the lack of sub-consensus variant calling tools for long-read datasets and provide an adequate workaround for applying LoFreq to nanopore reads, which is crucial for future outbreak surveillance and pathogen evolution studies\u003c/p\u003e","manuscriptTitle":"Sub-consensus haploid variant calling in Long-read sequencing technology","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-03 07:49:05","doi":"10.21203/rs.3.rs-6226988/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a469f5a2-8a04-45d6-9798-4d035161c52a","owner":[],"postedDate":"June 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-07-16T16:08:51+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-03 07:49:05","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6226988","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6226988","identity":"rs-6226988","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00