A single-molecule nanopore sequencing platform

doi:10.1101/2024.08.19.608720

A single-molecule nanopore sequencing platform

2024 · doi:10.1101/2024.08.19.608720

preprint OA: closed CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 68,014 characters · extracted from oa-pdf · 4 sections · click to expand

Abstract

Nanopore sequencing, a third-generation sequencing technology, has revolutionized the gene sequenc- ing industry with its advantages of long reads, fast speed, real-time sequencing and analysis, and potential in detecting base modifications. This technology allows researchers to sequence longer DNA fragments in a single read, providing more comprehensive genomic information compared to previous methods. Nanopore sequencing operates on electrical signals generated by a nanopore embedded in a membrane separating two electrolyte-filled chambers. When single-stranded DNA (ssDNA) passes through the nanopore, it creates variations in the current that correspond to different DNA bases. By analyzing these current fluctuations with machine learning algorithms, the DNA sequence can be determined. In this study, we introduced several improvements to nanopore sequencing, including nanopore local chemistry sequencing, novel motor and pore proteins, chip design, and basecalling algorithms. Our new nanopore sequencing platform, CycloneSEQ, demonstrated long-duration se- quencing (107 hours) on a single chip with high yield (>50 Gb). In human genomic DNA sequencing, CycloneSEQ was able to produce long reads with N50 33.6 kb and modal identity 97.0%. Preliminary findings on human whole-genome de novo assembly, variant calling, metagenomics sequencing, and single-cell RNA sequencing have further highlighted CycloneSEQ’s potential across different areas of genomics. 1 Introduction Nanopore sequencing, which has emerged as a novel sequencing technology in recent years, has revolutionized the gene sequencing industry due to its advantages of long reads, real-time sequencing, portability, and minimal library preparation [1]. This technology enables researchers to sequence longer fragments of DNA in a single read, providing more comprehensive genomic information compared to previous methods. Nanopore sequencing is based on electrical signals [2]. It involves a nanopore, which can be either a protein or solid-state structure, embedded in a membrane .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 that separates two electrolyte-filled chambers. When a voltage is applied across these chambers, it generates a steady transmembrane current. As molecules, such as single-stranded DNA (ssDNA), enter the nanopore, they obstruct the flow of ions, creating variations in the current known as nanopore signals. The obstruction of the current varies with different DNA bases (adenine, thymine, cytosine, and guanine) as the ssDNA passes through the nanopore. By detecting these current fluctuations and analyzing them using machine learning algorithms, the DNA sequence can be determined [3] [4] [2]. The development of nanopore sequencing can be traced back to the 1980s, encompassing stages of concept validation, technology development, and commercial application. In 1989, scientists George Church, David Deamer and Daniel Branton proposed the concept of using nanopores for DNA sequencing [ 5]. In 1996, Kasianowicz and colleagues first demonstrated the phenomenon of current blockade by DNA molecules passing through an α-hemolysin protein nanopore, laying the foundation for nanopore sequencing [ 6]. In 1997, Deamer and Akeson further validated that the current signals produced by single nucleotides passing through a nanopore could be used to distinguish different nucleotides [7]. In 2012, Oxford Nanopore Technologies launched the first portable nanopore sequencer, the MinION, marking the entry of nanopore sequencing into practical application. In 2016, Oxford Nanopore Technologies introduced the PromethION, a high-throughput nanopore sequencing platform that further enhanced sequencing speed and accuracy. Currently, nanopore sequencing technology is widely applied in genomics, transcriptomics, epigenetics, and clinical diagnostics [5]. In this study, we experimented with several improvements to the nanopore sequencing technology, including novel motor and pore proteins, chip design, basecalling algorithms, and nanopore local chemistry (NLC) sequencing method. We demonstrate our new nanopore sequencing platform, CycloneSEQ, is able to perform long-duration (107 h) sequencing on a single chip with high yield (>50 Gb). Whole-genome sequencing of the HG002 cell line produced long reads with N50 33.6 kb and modal identity 97.0%. Further data analyses of CycloneSEQ confirmed its capability of high- throughput long-read sequencing and potential in genomic, metagenomic and epigenomic applications. Preliminary down-stream analyses on human whole-genome de novo assembly, variant calling, metagenomics sequencing and single-cell RNA sequencing further demonstrated the potential of CycloneSEQ in various domains of genomics. 2 Results 2.1 Screening of motor and pore proteins Motor proteins and pore proteins are two key components of a nanopore sequencing system, playing crucial roles in the precise and efficient sequencing of nucleic acids [ 8]. We selected helicases as our motor proteins for nanopore sequencing due to their inherent ability to unwind dsDNA, a critical function for sequencing applications. Through comprehensive sequence and structural searches within deep-sea metagenomic databases, we identified numerous motor proteins with novel sequences and structures. These newly discovered proteins exhibit low sequence homology (approximately 35%) to known helicases, indicating their unique evolutionary paths and potential for novel functionality. The structures of these proteins and ssDNA complexes predicted by AlphaFold3 [ 9]show that they possess distinct helicase characteristics and exhibit significant structural novelty compared to known structures (Fig. 1a). Following extensive experimental screening and mutational engineering, we found that most of these motor proteins were well-suited for nanopore sequencing. For example, BCH-X, a member of candidate proteins, demonstrated strong DNA binding and 5’ to 3’ DNA unwinding activity. This activity is essential for maintaining the progression of DNA strands through the nanopore. By screening mutants of BCH-X, we achieved a sequencing speed of approximately 380 bp/s under our sequencing conditions, with high uniformity (Fig. 1b). High speed uniformity is critical in nanopore sequencing because it ensures consistent data output and reduces the likelihood of errors, thereby enhancing the overall accuracy and reliability of the sequencing process. In tandem with motor proteins, pore proteins are integral to the function of nanopore sequencing systems, as they form the channels through which nucleic acids are translocated and detected. Similarly to our approach with motor proteins, we identified several different families of pore proteins with novel sequences and structures from deep-sea metagenomic databases. These proteins exhibit less than 50% sequence homology to known pore proteins, highlighting their potential for providing new insights and capabilities in sequencing technologies. AlphaFold3[9] structure prediction and protein preparation results show that they can form a nanoscale channel structure as a homomultimer. (Fig. 1a) Using BCP-Y as an example, in pore insertion experiments, BCP-Y can efficiently embed into the membrane and exhibits low-noise open pore currents at different voltages, demonstrating its potential for application in nanopore sequencing (Fig. 1c). By screening a large number of mutants (especially in the sensor region) and combining them with the motor protein BCH-X, the pore protein BCP-Y can facilitate ssDNA translocation and sequencing with high signal complexity and good signal-to-noise ratio of the sequencing current signal (Fig. 1d). This ultimately led to a significant improvement in the accuracy of BCP-Y nanopore sequencing. Additionally, novel structural features at the BCP-Y “lip” motif (abundant 2 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 1: Screening of motor and pore proteins. (a) Schematic diagram of AlphaFold3 structure prediction of candidate helicases (top) and pore (bottom) proteins. (b) Distribution of nanopore sequencing speed of helicase BCH-X. (c) V oltage ramping study of pore portein BCP-Y , with voltages set at 0 V , 0.02 V , 0.04 V , 0.10 V , 0.14V and 0.18 V .(d) Representative nanopore sequencing current signal of a single DNA strand generated by helicase BCH-X coupled with pore protein BCP-Y . The magnified section displays "current squiggles" caused by different nucleotides translocating through the nanopore. positive charges) endow it with enhanced nucleic acid capture capabilities, facilitating more efficient DNA threading through the pore and thereby contributing to the overall efficiency and reliability of the sequencing process. 2.2 Pre-training and fine-tuning of the basecalling algorithm Existing basecalling models for nanopore sequencing primarily use supervised training, requiring large amounts of labeled sequencing data [10]. This method is costly and involves extensive training cycles. When data is insufficient, prediction accuracy suffers, leading to high costs and low accuracy [ 8]. We adopted a pre-training and fine-tuning approach to address these issues (Fig. 11). During pre-training, the model learns from vast amounts of unlabeled data, allowing it to "understand" the data. Fine-tuning then uses pre-trained weights for rapid convergence and enhanced accuracy. Inspired by Facebook’s wav2vec 2.0 [ 11], a pre-training method for speech tasks, we employed it for basecalling. Wav2vec 2.0 uses a large corpus of unlabeled speech data for pre-training and a small amount of labeled data for fine-tuning downstream tasks, demonstrating this approach’s feasibility. Our pre-training utilized a dual loss function: Contrastive Loss to gauge the context network’s predictive capability, and Diversity Loss to enhance the quantization codebooks’ expressiveness. We identified a flaw in wav2vec 2.0’s handling of contrastive loss, where positive and negative samples could map to the same quantized vector. Our improvement penalizes such occurrences, reducing cases where contrastive loss approaches zero and enhancing training effectiveness (see Methods). After pre-training, the model was fine-tuned for base identification. A linear layer maps the output to categories representing the four nucleobases and a placeholder. We used CTC-Loss for optimization. The fine-tuned model showed marked improvements in error rates and faster convergence compared to models initialized randomly (Fig. 10), highlighting the effectiveness of leveraging pre-trained weights, especially with limited labeled data. Experiments on human and other species data indicated that pre-training enables the model to generalize across species, evidenced 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 by superior performance on species not covered during pre-training. Additionally, non-masked fine-tuning yielded better results, contrary to wav2vec 2.0’s findings [11], suggesting task-specific differences in optimal training strategies. Pre-training on a diverse dataset covering various species reduces error rates and accelerates convergence. Future directions include scaling up the model and training samples for enhanced accuracy and expanding species diversity in training data for broader downstream task support. Additionally, leveraging weak label data, as demonstrated by OpenAI’s Whisper model, could enhance sequencing models’ robustness and utility. 2.3 Nanopore local chemistry (NLC): a novel method for single-molecule sequencing The local chemical environment within or near the nanopore is a critical determinant of the performance and accuracy of nanopore sequencing technologies. Variations in local ion concentration, pH, and the presence of other molecular species can significantly influence the ionic current, biochemical reactions, and consequently, the detection and discrimination of nucleotides as they translocate through the pore. The local ion concentration, in particular, affects the electrostatic landscape of the nanopore, which can alter the speed and behavior of nucleic acid molecules during sequencing. Understanding and controlling the local chemical conditions are therefore essential for optimizing the sequencing process, reducing error rates, and achieving high-fidelity reads. Ongoing research into the local chemistry of nanopores aims to elucidate the complex interplay between these factors and to develop strategies for maintaining optimal conditions throughout the sequencing run. Beyond optimizing data quality by manipulating the local chemical environment near the pore, we introduced a novel sequencing method, termed nanopore local chemistry (NLC) sequencing. We first created an asymmetric chemical environment on each side of the nanopore. On the cis side, the sequencing buffer contained no magnesium ions (Mg2+), while the electrolyte on the trans side contained 20 mM magnesium ions (Fig. 2a). DNA helicase requires both magnesium ions and ATP to properly unwind the DNA double helix. Specifically, magnesium ions first bind with ATP to form an Mg-ATP complex. This complex, which is the actual substrate for DNA helicase, can be recognized and utilized by the helicase. When we introduced library molecules (a mixture of dsDNA and helicase) on the cis side, the DNA double helix could not be properly unwound due to the lack of Mg2+ (Fig. 2a). However, after applying the transmembrane potential ( U = 180 mV), Mg2+ ions were transported from the trans microwell to the cis side through the nanopore, creating a local concentration gradient of Mg2+ on the cis side near the pore (Fig. 2b). According to the simulation results (Fig. 2c), the Mg 2+ concentration maximized near the pore and decayed rapidly. Library molecules captured by the nanopore electric field were pulled near the nanopore entrance and exposed to the Mg2+-rich environment. Magnesium ions near the pore bound with ATP to form an Mg-ATP complex, which was then utilized by the helicase, thus activating sequencing (Fig. 2d). The ionic current trace produced by NLC sequencing are shown in Fig. 2e compared with conventional nanopore sequencing current trace (Fig. 2f). In both cases, we applied a transmembrane potential of 180mV . The open pore current, when there isn’t DNA translocating through the pore, of NLC is 206.70 pA with and a standard deviation of 0.51 pA. The open pore current of conventional nanopore sequencing is 193.91 pA with a standard deviation of 0.76 pA, respectively. The mean current during sequencing, when DNA passing through the pore, for NLC is 49.1 pA, with a standard deviation of 7.25 pA, an amplitude of 61.22pA. The mean current during sequencing for conventional nanopore sequencing method is 57.17 pA, with a standard deviation of 11.00 pA, an amplitude of 68.05pA. To draw a conclusion, both methods exhibit very similar characteristic current values. As shown in Fig. 2 e-f, the library DNA molecule (approximately 1.5 kb in total length) contains three repetitive sequences, which are successfully reflected in the current signals of both sequencing methods. 2.4 A nanopore sequencing platform based on improved chip design The design of the nanopore sequencing chip is a critical factor in the advancement of nanopore sequencing technology, which offers a unique approach to DNA and RNA sequencing by monitoring changes in ionic current as nucleic acids pass through a biological nanopore. This technology relies on a sensor chip designed with arrays of microwells, which support membrane arrays and contain microelectrodes at the bottom of each well. In this setup, biological nanopores are inserted into the membrane arrays that are uniformly formed on the sensor chip. The membranes are self-assembled in a bilayer form via lipid molecules. Each nanopore is electrically connected to electrodes that precisely measure the ionic current disruptions caused by nucleotide sequences moving through the pore. The core metrics of a nanopore sequencer are primarily sequencing throughput and accuracy. To enhance sequencing throughput, we employed high-density nanopore arrays on the sequencing chip and optimized the spatial distance between nanopores to maximize the parallel processing of nucleic acid strands. The pitch distance between each microwell is around 200 µm, resulting in a maximum nanopore density of approximately 28.9 per mm² (Fig. 3a). 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 2: The nanopore local chemistry (NLC) sequencing method. (a) Schematic representation with no trans-membrane potential applied ( U = 0 mV). Magnesium ions are present in the trans microwell, but not in the cis microwell, preventing dsDNA unwinding. (b) When the trans-membrane potential is activated (U = 180 mV), magnesium ions translocate from the trans microwell to the cis side through the nanopore, creating a localized concentration of magnesium ions. (c) COMSOL simulation of Mg2+ ion concentration distribution (U = 0 mV). (d) COMSOL simulation of Mg2+ ion concentration distribution when a trans-membrane potential is activated (U= 180 mV). (e-f) Sequencing current trace of nanopore local chemistry (NLC) method (e) and the conventional (f) method. The library DNA molecules contain three repetitive sequences, which are successfully reflected in the current signals of both sequencing methods (indicated by red dashed boxes). Additionally, we engineered the microwell wall structure to maximize the electrolyte buffer volume within each well, leading to an electrochemical system with prolonged stability. To improve sequencing accuracy, we focused on enhancing the signal-to-noise ratio of the chip. This involves implementing smaller apertures for the microwells. In our system, minimizing the size of the aperture (with a diameter of ≤76 µm) results in a smaller final membrane area (Fig. 3b). A smaller membrane area leads to lower membrane capacitance (≤20 pF) and reduced noise, which is electrically coupled to the measuring system. 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 3: Improved chip design for nanopore sequencing. (a) Scanning electron microscope (SEM) micrograph of sensor arrays for nanopore sequencing. Sensor units are spaced approximately 200 µm apart and are arranged in a honeycomb pattern to maximize sensor density. (b) Confocal image of the membrane array formed on the sensor chip. Electrolyte in each microwell is shown in red, while the membrane solution is shown in green. (c) Median open pore current of all effective single nanopores on a sensor chip during a 112-hour sequencing run. (d) Cumulative number of bases sequenced over time in a 107-hour sequencing on a single flowcell. These design improvements enabled our sensor chip to support over four days of continuous sequencing with consistent open-pore currents (Fig. 3c). We sequenced the E. coli genome for 107 hours on a single flowcell. This sequencing run cumulatively yielded 53.4 Gb data that passed the internal basecalling quality criteria (Fig. 3d), demonstrating the possibility to achieve high sequence yield by sequencing for a prolonged period of time. Among the 12.6 million reads generated, 12.1 million (95.6%) was mapped to the E. coli reference genome. We note that no buffer re-flush or library washing was employed here, which are common ways to maintain sequencing speed and accuracy. Based on the novel nanopore sensor chip design described above, we have successfully constructed a nanopore based single-molecule sequencing platform named as CycloneSEQ. As illustrated in Fig. 4b, The flow cell module of CycloneSEQ comprises a microfluidic chip enabling the transportation and temporary storage of sample molecules as well as supporting electrochemical reaction, an arrayed chip containing nanopores, a signal acquisition application- specific integrated circuit (ASIC), and a printed circuit board with surface mounted components. Cell samples to be sequenced are processed through lysis, nucleic acid extraction, and other methods to extract long-chain DNA molecules. These DNA molecules are then subjected to DNA repair and adapter ligation. Subsequently, we mount flow cells in the socket of the CycloneSEQ sequencer and perform a chip self-check. After the self-check process, the system indicates whether the chip meets the quality criteria and the number of effective nanopores on each individual chip. After self-check process, we sequentially add the sequencing reagents and the library molecules to be sequenced into the micro-port of the chip, following a specific order. Then, we initiate the sequencing process through the software. Owing to the characteristics of nanopore single-molecule sequencing, as soon as the sequencing starts, the high-performance workstation paired with the sequencer can commence the base calling process. The CycloneSEQ sequencer is capable of supporting sequencing and base calling in real time simultaneously. 6 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 4: A single-molecule nanopore sequencing platform. (a) Schematic representation of a sensor unit constructed with an insulating membrane containing an inserted nanopore, along with cis and trans chambers and corresponding electrodes. (b) Exploded view diagram of a flow cell. (c) Library preparation process for nanopore sequencing. (d) CycloneSEQ nanopore sequencer. (e) High-performance workstation and operating software for the CycloneSEQ nanopore sequencing platform. 2.5 Error profile of CycloneSEQ To systematically evaluate the performance of the CycloneSEQ platform, we generated whole-genome sequencing (WGS) data for the thoroughly characterized Genome in a Bottle (GIAB) consortium HG002 lymphoid cell line. Overall, the read lengths of the HG002 WGS data were distributed over a broad range from 50 kb, with a mean read length of 19.2 kb. The N50 value, defined as the length of the longest read that, together with longer reads, contain over 50% of all sequenced bases, was 33.6 kb (Fig. 5a). The mean bases quality values were predominantly in the range between 12 and 16, with a small cluster of short reads that have lower quality scores (Fig. 5b). The distribution of quality scores were fairly consistent in different relative positions of each read, and only dropped slighly near 5’ and 3’ ends (Fig. 5c). We aligned the sequencing reads to the diploid HG002 v1.0.1 reference genome in a haplotype-specific manner (see Methods), and analysed the frequency and types of sequencing errors from based on the alignment pattern. After removing reads with mean base quality scores below 10, which accounted for less than 10% of total bases, the accuracy of most reads ranged between 93% and 99%, with a modal value at ˜97% (Fig. 5d). The overall per-base error rate was estimated to be 3.94%, with deletions being the most frequent type of error, with an error rate contribution of 2.34%, followed by mismatches (0.83%) and insertions (0.77%) (Fig. 5e). For mismatch errors, A-to-G and G-to-A errors were significantly more common than other types of base substitutions, both of which had more than 0.2% error rate contributions (Fig. 5f). The enrichment of A-to-G and G-to-A errors was presumably due to the structural similarity between adenine and guanosine nucleotides, which lead to similar current signals [12]. Among the insertion and deletion errors, most errors affected either one or two bases, with only less than 10% insertion and deletion errors affecting three of more bases (Fig. 5g). Comparing per-read error rates estimated by reference alignment with reported mean base quality scores, we found that the reported quality scores was remarkably close to the actual sequencing accuracy (Fig. 5h). In contrast, similar systematic biases of quality scores often exist in other long-read sequencing platforms, which may affect downstream applications that rely on quality scores, such as variant calling [12]. We further analyzed the consensus accuracy of CycloneSEQ reads based on the de novo assemblies of the E. coli genome. The phred-scale quality value (QV) increased with coverage depth and reached 40 (i.e. error rate 0.01%) at 40× coverage (Fig. 5i), confirming the possibility to acquire highly accurate consensus sequences using CycloneSEQ data alone. 2.6 Variant calling and de novo assembly of the HG002 genome Variant calling and de novo assembly are among the most important applications of long-read sequencing in genomics research. For variant calling, long-read sequencing provides longer sequences that can resolve complex structural variants that are challenging for short-read sequencing. Longer reads also reduce ambiguity in read alignment, 7 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 5: Whole-genome sequencing of the HG002 cell line. (a) The distribution of read lengths. The cumulative fraction of bases (y axis) contained in reads longer than a given threshold (x axis) is shown. Gray dotted lines denote N50, which represents the length of the longest read that, together with longer reads, contain over 50% of all sequenced bases. b Joint distribution of read lengths (x axis) and read quality scores (y axis). (c) Mean base quality (y axis) along different relative positions of the read (x axis). (d) Distribution of per-read identity. Per-read identity is calculated by aligning reads to the diploid HG002 v1.0.1 reference genome. (e) The overall error rate and the contributions of insertion, deletion and mismatch errors. (f) Contribution of each type of mismatch errors to the overall error rate. (g) Distribution of indel sizes for insertion and deletion errors. (h) Base quality calibration curve showing the relationship between per-read error rate (y axis) and per-read mean base quality (x axis). The blue line with dots represents observed results in CycloneSEQ data and the black line represents the expected error rates based theoretical calculations. (i) Consensus accuracy of CycloneSEQ reads (y axis) plotted against mean coverage depth (x axis). (d)-(g) are based on reads with mean base quality ≥ 10. eliminating potential alignment errors in complex genomic regions. For de novo assembly, longer reads cover larger genomic regions in single reads, reducing computational complexity, and are more likely to span large repetitive elements, improving assembly contiguity. We observed that the coverage depth of CycloneSEQ reads were highly uniform across the human genome (except in the repeat-rich centromere regions prone to alignment errors), providing solid support for both variant calling and de novo assembly applications (Fig. 13). Here, we generated variant calling and haplotype-resolved de novo assembly results for the HG002 genome using CycloneSEQ data. Variant calling was performed using our in-house bioinformatics tools LRAPmut and LRAPsv (see Methods) and compared against the Genome in a Bottle (GIAB) HG002 benchmark [13]. For single-nucleotide polymorphisms (SNPs), we achieved a precision of 0.992 and a recall of 0.990 at a sequencing depth of 30× (Fig. 6). Small insertions and deletions (indels) present a challenge for variant calling due to their similarity to the predominant sequencing errors in CycloneSEQ data (Fig. 6). Utilizing variant imputation based on the 1000 Genomes reference panel (see Methods), we attained a precision of 0.955 and a recall of 0.890 at 30 × coverage (Fig. 6). For structural 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 6: Variant calling of HG002. Precision (a), recall (b) and F1-score (c) statistics of HG002 variant calling results based on CyclongSEQ data. The GIAB HG002 variant benchmark dataset was used as the ground truth. variants (SVs), we observed that increased sequencing coverage had a more pronounced effect on improving precision and recall compared to SNPs and indels. Specifically, we achieved a precision of 0.948 and a recall of 0.954 at 40× coverage (Fig. 6). The haplotype-resolved whole-genome de novo assembly for HG002 was generated using the Shasta assembler [14]. This assembly was evaluated against the Telomere-to-Telomere (T2T) consortium HG002 reference sequence (see Methods). We found that most chromosomes were assembled with a high level of completeness, with only the complex, repeat-rich centromere regions missing from the assembly (Fig. 7a). The short arms of the five acrocentric chromosomes—13, 14, 15, 21, and 22—were assembled with fragmented contigs due to the presence of satellite repeats and high sequence similarity among them. Other parts of the genome were mostly assembled with long, haplotype-resolved contigs, except the two sex chromosomes, likely due to the limited ability of the current Shasta implementagtion to handle the haplodity of X and Y chromosomes and the sequence homology between them. The NGA50 value of the assembly was 23.8 Mb (Fig. 7b), indicating that 50% of the genome was assembled with contigs of at least 23.8 Mb in length. The overall error rate of the assembly was estimated to be 0.12%, with deletions contributing the most to the overall error rate, followed by insertions and mismatch errors (Fig. 7c). Further developments on read lengths, accuracy and assembly methods will likely improve assembly contiguity and quality in the future. 2.7 Metagenome sequencing of mock sample Metagenomic sequencing is an important application of nanopore sequencing. To evaluate the performance of CycloneSEQ in assembling microorganism genomes and estimating their relative abundance from a mixed sample, we generated 7.7 Gb sequencing data from the ZymoBIOMICS Gut Microbiome Standard mock metagenome sample, which contained a mixture of 17 microorganism species of predefined abundances. By alignment of CycloneSEQ reads to the corresponding reference genomes, we were able to accurately quantify the relative DNA abundance of 15 out of the 17 species in the sample based on sequencing depths, including both high-GC and low-GC species (Fig. 8a). Only two of the least abundant species could not be identified from sequencing data: Enterococcus faecalis (abundance 0.001%) and Clostridium perfringens (abundance 0.0001%). In addition to reference-based quantification, we performed de novo assembly using the Flye assembler [15] based on CycloneSEQ data. Among the 17 species in the sample, ten species had relative DNA and genome copy abundances above 1%, all of which were assembled with high levels of genome completeness, with nine out of ten genomes successfully circularized (Fig. 8b and Table 1). Based on the assembled genomes, we were also able to perform accurate quantification of the copy numbers and sequence lengths of 16S, 5S and 23S rRNA of these ten species (Tables 2, 3, and 4). 2.8 Single-cell RNA sequencing of GM12878 cell line Long-read sequencing in single-cell RNA sequencing (scRNA-seq) offers significant advantages, including the ability to capture full-length transcripts, which provides a more comprehensive view of isoform diversity, alternative splicing, and gene fusion events. This detailed transcriptome profiling enhances the understanding of cellular heterogeneity 9 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 7: Haplotype-resolved whole-genome assembly for HG002. (a) Alignment of the assembled sequences to the T2T HG002 reference sequence. Each row represent a chromosome. Haplotype-resolved contigs and unresolved unitigs are shown in blue and gray, respectively.(b) Assembly contiguity represented as the fraction of genome assembled (y axis) for each minimum contig length (x axis). (c) Overall error rate of the assembly and error rates for deletion, insertion and mismatch errors. and complex gene regulatory networks. Here, we present preliminary results on applying the CycloneSEQ platform in scRNA-seq using mRNA from the GM12878 cell line, and compare the results with scRNA-seq data generated from the same cDNA libraries using BGI DNBSEQ short-read sequencing. As most transcripts were less than 10 kb in length, it was possible to obtain full-length coverage from single CycloneSEQ long reads. We found that the mean coverage depth was slightly higher near the 3’ end of each gene and lower near the 5’ end of each gene (Fig. 9a), likely due to incomplete reverse transcription and/or degregation of mRNA near 5’ ends. Despite this, the overall coverage depths were highly uniform from the 5’ to 3’ end of each transcript (Fig. 9a), providing solid support for the discovery of potential novel isoforms. The total number of genes detected in each cell by CycloneSEQ ranged between 300 and 4,000 and showed a strong linear correlation (R2 = 0.95) with that of DNBSEQ data. Gene expression quantification

Results

from CycloneSEQ were also highly consistent ( R2 = 0.93) with those of DNBSEQ short-read sequencing, suggesting that our CycloneSEQ platform was capable of accurate transcript quantification in single-cell sequencing. 3 Discussions Nanopore sequencing has revolutionized genomics by enabling the real-time analysis of nucleic acids without the need for amplification or chemical labeling. Despite its transformative impact, several limitations, such as relatively high error rate and low throughput, still hinder its broader adoption and effectiveness. In this study, we developed several novel components of nanopore sequencing technology, including motor and pore proteins, chip design, and basecalling algorithms. We show that our new sequencing platform, CycloneSEQ, is capable of generating high-throughput long-read sequencing data useful for downstream applications. 10 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 8: Metagenome sequencing and methylation prediction. (a) Correlation between expected (x axis) and observed (y axis) relative DNA abundance in the mock metagenome sample. Each dot represents a microorganism strain. Dot colors represent the GC content of the genome of the assayed strains. (b) Assembled genome fraction of ten species in the mock metagenome sample. Colors represent the relative DNA abundances. Figure 9: Single-cell RNA sequencing of GM12878 cell line. (a) Normalized coverage depths (y axis) along different relative positions of the gene body (x axis). Transcripts shorter than 100 bp are not included. (b) Correlation between the number of detected gene in each cell by DNBSEQ (x axis) and that by CycloneSEQ (y axis) platforms. Each dot represents a cell. R represents the Pearson regression coefficient. (c) Correlation between pseudobulk gene expression levels measured by DNBSEQ (x axis) and that by CycloneSEQ (y axis) platforms. Each dot represents a gene detected by both platforms. R represents the Pearson regression coefficient. One significant limitation of current nanopore sequencing technology is its accuracy. Although improvements have been made, the error rate remains higher compared to short-read sequencing technologies. This issue can be addressed by screening for novel motor and pore proteins and developing more advanced basecalling algorithms that could enhance the precision and speed of nanopore sequencing [16]. Future development and integration of non-protein pores, such as synthetic nanopores, may provide more stable and consistent results, reducing the error rates and increasing the reliability of the technology [17]. Another area of improvement lies in the membranes used for nanopore sequencing. Novel membranes with enhanced stability and reduced noise could significantly improve the quality of the sequencing data [18]. Advances in materials science could lead to the development of membranes that are more resilient to the harsh conditions often encountered during sequencing, thereby extending their lifespan and efficiency. Sequencing of modified DNA and RNA bases remains a challenge for nanopore technology. Modified bases play crucial roles in various biological processes, and accurate detection is essential for understanding epigenetics and other regulatory mechanisms. Enhancing the capability of nanopore sequencing to accurately read modified bases would be a significant breakthrough, potentially achieved through the use of advanced bioinformatics algorithms and improved pore chemistry [19]. The potential of nanopore technology extends beyond nucleic acids to the sequencing of amino acids, peptides, and proteins. Protein sequencing using nanopores could revolutionize proteomics, enabling the direct analysis of proteins 11 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 and their post-translational modifications. Although still in its infancy, this application holds promise for significant advancements in understanding protein structure and function [20]. Clinical applications of nanopore sequencing are vast, ranging from rapid pathogen identification to point-of-care testing. However, the high costs and limited throughput of current systems restrict their widespread use in clinical settings. Reducing costs and improving throughput are critical for the adoption of nanopore sequencing in routine clinical practice [ 8][21]. Innovations in sequencing chemistry, automation, and data analysis could make nanopore sequencing more accessible and practical for clinical diagnostics. Additionally, nanopore sequencing holds great potential for population cohort studies, providing insights into genetic diversity and disease susceptibility on a large scale. The ability to sequence entire genomes quickly and cost-effectively, and accurately characterize structure variation could transform epidemiological studies and public health initiatives. Continued advancements in reducing costs and increasing throughput are essential to fully realize the potential of nanopore sequencing in both clinical and population cohort applications. 4 Methods 4.1 Development of Motor Proteins and Pore Proteins All protein sequences were derived from a deep-sea metagenomic database, and all mutant designs were based on AlphaFold3 [9] structure predictions. The proteins were overexpressed in BL21(DE3) or similar strains, followed by purification using affinity chromatography, ion exchange chromatography, and size-exclusion chromatography. DNA libraries were prepared from the motor protein BCH-X mutants, Y-shaped adaptors and input DNA. After embedding the pore protein BCP-Y mutants into membranes, sequencing buffer and the test libraries were added, and then sequencing was performed at 0.18V and 30°C to collect current signals. The sequencing speeds were obtained by dividing the length of the specific sequence DNA by its translocation time. The open pore currents of BCP-Y mutants were recorded at different voltages (0V , 0.02V , 0.04V , 0.10V , 0.14V , and 0.18V). 4.2 Training and validation of basecalling models 4.2.1 Model architecture Our model consists of three key components (Fig. 11): (1) Feature Encoder: A multi-layer 1D convolution network processes raw signals through feature encoding, capturing relevant information from the sequencing data. It includes multiple blocks, each with 1D convolution and GELU activation for downsampling and extracting local patterns; (2) Quantization Module: The feature encoder output is discretized using product quantization into a finite representation space, enhancing the model’s self-supervised learning capability. (3) Mask and Context Networks: The feature encoder outputs are processed by a mask module before being fed into the context network, composed of multiple transformer layers. This network approximates relative positional encoding and enhances contextual understanding. 4.2.2 Improved definition of the Contrastive Loss The objective function of self-supervised pre-training comprises two components: Contrastive Loss and Diversity Loss. These components optimize different aspects of the training process to improve model performance. Contrastive Loss measures the contextual network’s ability to predict future outputs. The Transformer’s output at time t (ct) should match the quantization module’s output at timet (qt). Additionally, ct should differ from outputs at k other randomly selected positions (distractors) in the sequence. The original Contrastive Loss is defined as: Lm = − log exp(sim(ct, qt)/k)P q′∼Qt exp(sim(ct, q′)/k) where Qt = {qt, q′ 1, q′ 2, ..., q′ i, ..., q′ K}, sim() is a predefined similarity function, q′ i is the i-th interference vector, and k is a predefined temperature parameter. We identified an issue where positive and negative samples could map to the same quantized vector, undermining performance. Assuming qt is the positive sample’s quantized vector at time t, the corresponding negative sample quantized vectors are divided into q′, not equal to qt, and p′, equal to qt: Lm = − log exp(sim(ci, qi)/k) exp(sim(ci, qi)/k) + P q′ exp(sim(ci, q′)/k) + P p′ exp(sim(ci, p′)/k) 12 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Negative samples mapping to the same quantized vector as positive samples undermine contrastive loss. In wav2vec 2.0, such negatives do not contribute to the contrastive loss, making the last denominator term zero. This issue halts the training process when contrastive loss becomes zero, preventing gradient updates. We introduced a penalty count(p′) · l for such cases, with l as the penalty coefficient (l = 0.01): Lm = − log exp(sim(ci, qi)/k) exp(sim(ci, qi)/k) + P q′ exp(sim(ci, q′)/k) + count(p′) · l Our method applies penalties to both Contrastive Loss and Diversity Loss, ensuring negative samples contribute meaningfully. This improvement reduces instances where contrastive loss reaches zero, enhancing training robustness and effectiveness. 4.2.3 Pre-training and fine-tuning The pre-trained data originated from the BGI Cyclone nanopore sequencing platform, including sequences from various species: 80% human genome, 20% rice, Saccharomyces cerevisiae, and Bacillus subtilis. Data was sampled at 5 kHz, filtering abnormal signals, and divided into chunks of 5000 signal points, creating 300 million pre-training chunks. We used a 15% probability mask, mask length of 5. The feature encoder employed a 3-layer 1D convolution with a stride of (1,1,5), kernel size of (5, 5, 19), and dimension of (4, 16, 768). The context network utilized a 10-layer transformer with 512 hidden units and 8 heads. Training on 64 NVIDIA A100-PCIE-40GB GPUs for 3.5 days, batch size 1024, using Adam optimizer with learning rate 0.005 and linear decay. α = 0.1 for diversity loss, K = 100 for negative examples, G = 2 and V = 320 for the quantization module. Fig. 12 shows training and validation loss over iterations. Fine-tuning used data from the BGI Cyclone sequencing platform. An existing basecaller (Bonito) predicted base sequences, compared with a standard library, selecting those with coverage >95% as labels. Post pre-training, the model was fine-tuned on 40 million annotated human samples, with 10,000 evaluation samples, for 1.3 days on 16 NVIDIA A100-PCIE-40GB GPUs, batch size 256. Using Adam optimizer with a warm-up over 200 steps followed by linear decay, we experimented with learning rates of 0.0005 and 0.001, reporting the best results. Models were evaluated using median error rate. 4.3 Nanopore local chemistry (NLC) sequencing We established a model in COMSOL to simulate and analyze the concentration distribution of Mg2+ in the cis and trans chambers on both sides of the nanopore. The system model parameters were set as follows: the nanopore opening diameter was 2 nm, the length was set to 10 nm, and the width and height of both the cis and trans chambers were set to 100 nm and 52.5 nm, respectively. The cis chamber was filled with a 0.5 M KCl aqueous solution, and the trans chamber was filled with a 0.5 M KCl and 20 mM MgCl2 aqueous solution. In this model, we considered two main transport mechanisms: ion diffusion and ion electrophoresis. Given that the nanopore interior was assumed to be neutral without any charge settings, the effect of electroosmotic flow was neglected. Different trans-membrane potentials were applied to the system, and the steady-state Mg2+ concentration gradient distribution was calculated via COMSOL. As shown in Fig. 2c, the lower chamber was subjected to a boundary condition potential of 0 V , and the upper chamber was grounded at 0 V , resulting in no applied potential difference across the nanopore. Under this condition, there was no significant ion electrophoresis behavior guided by an electric field. The concentration gradient distribution of Mg2+ was primarily determined by free diffusion of ions. As shown in Fig. 2d, when the lower chamber was subjected to a boundary condition potential of 0.18 V and the upper chamber was grounded at 0 V , an applied potential difference of approximately 0.18 V was established across the nanopore. 4.4 Nanopore sensor chip The nanopore sensor chip was fabricated using standard semiconductor manufacturing techniques, including pho- tolithography, thin film deposition, and patterning. The confocal image was captured using a Nikon C2+ microscope. The membrane solution was mixed with 2 µM BODIPY PL fluorescent dye, which has an excitation peak at 502 nm and an emission peak at 511 nm. The electrolyte in the microwell was stained with 0.02% wt Sulforhodamine B, which has an excitation peak at 559 nm and an emission peak at 577 nm. 4.5 DNA Extraction and Library Preparation High molecular weight DNA extraction and optional length sorting were performed on the sample to obtain DNA suitable for long-read nanopore sequencing. The quality of the extracted DNA was assessed by measuring the A260/A280 ratio, 13 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 which was maintained between 1.8 and 2.0. End repair reagents were used to repair the ends of the DNA fragments and add deoxyadenosine (dA) tails, which facilitated subsequent adaptor ligation. Oligos were annealed in TE buffer to form Y-shaped adaptors. The sequencing library was prepared from motor protein BCH-X, Y-shaped adaptors and input DNA. 4.6 Sequencing error analyses For sequencing error analyses, low-speed sequencing mode was used to generate the HG002 whole genome sequencing data on the CycloneSEQ platform. We first randomly sampled 20,000 reads from the sequencing data, which were aligned to both haplotypes of the diploid reference genome HG002 v1.0.1 using Minimap2 [22]. Insertion, deletion and mismatch errors were identified from the CIGAR string of the resulting alignments. Error rates were calculated by dividing the total length of errors by the total alignment length. To compute read quality for each read, we first converted Phred-scale base quality values to error rates, and then calculated the average error rate of all bases in each read. The resulting average error rate was finally converted back to Phred-scale to represent the read quality of each read. Visualization was performed using Matplotlib [23] and Seaborn [24] in Python. 4.7 Variant calling We used minimap2 version 2.24-r1164-dirty to align reads to a reference genome. For small variant calling, we utilized LRAPmut version v1.0 (https://github.com/Roick-Leo/LRAPmut) with CycloneSEQ data, and for structural variant calling, we employed LRAPsv version v1.0 (https://github.com/Roick-Leo/LRAPsv). Haplotype imputation based on 1000 genomes reference panel was applied to improve the performance of indel variant calling. HG002 data was removed from the reference panel prior to imputation. To benchmark the variant calls, we assessed the small variant calls against the GIAB truth set using hap.py v0.3.15 (). Additionally, we used Truvari v4.2.2 (ref [ 25]) to produce performance metrics by comparing the predicted structural variants with the benchmark SVs. 4.8 De novo assembly of the HG002 genome The haplotype-resolved \textit{de novo} assembly of the HG002 genome was generated using Shasta v0.11.1 [ 14], based on CycloneSEQ reads at approximately 80× genomic coverage. The following command was used: shasta-Linux-0.11.1 \ --input \ --assemblyDirectory \ --config Nanopore-Phased-May2022 \ --threads 16 \ --memoryMode filesystem \ --memoryBacking disk Assembly evaluation was performed by aligning the assembly to the T2T HG002 reference sequence (https://github.com/marbl/HG002). Each assembly contig was separately aligned to the paternal and maternal haplotypes of the HG002 reference sequence. The alignment with the largest number of matched bases was selected for evaluation. Deletion, insertion, and mismatch error rates were calculated based on the CIGAR string of the resulting alignment. 4.9 Assembly and quantification of a mock metagenome sample The ZymoBIOMICS Gut Microbiome Standard (Catalog No. D6331) comprises 18 bacterial strains, 2 fungal strains, and 1 archaeal strain, with a theoretical genomic DNA abundance ranging from 0.0001% to 14%. Reference genomes, along with 16S and 18S rRNA genes, are accessible online [26]. DNA was extracted from the mock samples using the MGIEasy Stool Microbiome DNA Extraction Kit according to the manufacturer’s protocols. Chopper 0.6.0 (https://github.com/wdecoster/chopper) was utilized to filter out reads with a quality score lower than Q10 and a length shorter than 1,000 base pairs using the following parameters: -q 10 –minlength 1000. Then, the reads were assembled using the Flye 2.8.3-b1695 using the following parameters: –meta –nano-raw. Semibin2 2.1.0 was applied to generate bins for the mock community using the following parameters:(single_easy_bin –environment global –sequencing-type long_read). Assembly quality was assessed using Quast v5.0.2. Then, the rRNA genes were predicted using barrnap 0.9 using the following parameters: –kingdom -reject 0.01 -evalue 1e-3. tRNA was predicted by tRNAscan-SE 2.0.12 using the following parameters:-B . For quantification and evaluation, plasmid sequences were removed from the reference sequences. Relative DNA abundances of each strain was estimated 14 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 by aligning reads to the combined reference sequences of all strains, and summarising the total number of aligned bases for each strain. 4.10 Single-cell RNA sequencing of GM12878 cell line For DNBSEQ sequencing, sample preparation, cell isolation, mRNA extraction, reverse transcription, library preparation, sequencing were performed according to the manufacturer’s guidelines of the DNASEQ C4 platform. For CycloneSEQ sequencing, the same procedure was followed with the addition of PCR amplification of cDNA libraries. The amplified libraries was sequenced on two CycloneSEQ chips. scRNAseq data analyses was performed using an in-house pipeline based on GENCODE GRCh38 transcript annotations. 5 Acknowledgements This work was supported by "Pioneer" and "Leading Goose" R&D Program of Zhejiang (2024C03004), Shenzhen Science and Technology Program (KQTD20221101093603011), and National Key Research and Development Program of China (2022YFF1202103). 6 Competing interests All authors are employees of the BGI Group. The authors have submitted patent applications related to the methods or

Results

presented in this manuscript.

References

[1] Erwin L Van Dijk, Yan Jaszczyszyn, Delphine Naquin, and Claude Thermes. The third revolution in sequencing technology. Trends in Genetics, 34(9):666–681, 2018. [2] David Wendell, Peng Jing, Jia Geng, Varuni Subramaniam, Tae Jin Lee, Carlo Montemagno, and Peixuan Guo. Translocation of double-stranded dna through membrane-adapted phi29 motor protein nanopores. Nature Nanotechnology, 4(11):765–772, 2009. [3] Elizabeth A Manrao, Ian M Derrington, Andrew H Laszlo, Kyle W Langford, Matthew K Hopper, Nathaniel Gillgren, Mikhail Pavlenok, Michael Niederweis, and Jens H Gundlach. Reading dna at single-nucleotide resolution with a mutant mspa nanopore and phi29 dna polymerase. Nature Biotechnology, 30(4):349–353, 2012. [4] Gerald M Cherf, Kate R Lieberman, Hytham Rashid, Christopher E Lam, Kevin Karplus, and Mark Akeson. Automated forward and reverse ratcheting of dna in a nanopore at 5-å precision. Nature Biotechnology, 30(4):344– 348, 2012. [5] David Deamer, Mark Akeson, and Daniel Branton. Three decades of nanopore sequencing. Nature Biotechnology, 34(5):518–524, 2016. [6] John J Kasianowicz, Eric Brandin, Daniel Branton, and David W Deamer. Characterization of individual polynu- cleotide molecules using a membrane channel. Proceedings of the National Academy of Sciences, 93(24):13770– 13773, 1996. [7] David W Deamer and Mark Akeson. Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends in Biotechnology, 18(4):147–151, 2000. [8] Morgan MacKenzie and Christos Argyropoulos. An introduction to nanopore sequencing: past, present, and future considerations. Micromachines, 14(2):459, 2023. [9] Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pages 1–3, 2024. [10] Marc Pagès-Gallego and Jeroen de Ridder. Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling. Genome Biology, 24(1):71, 2023. [11] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449– 12460, 2020. [12] Clara Delahaye and Jacques Nicolas. Sequencing dna with nanopores: Troubles and biases. PloS one, 16(10):e0257521, 2021. 15 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 [13] Peter Krusche, Len Trigg, Paul C Boutros, Christopher E Mason, Francisco M De La Vega, Benjamin L Moore, Mar Gonzalez-Porta, Michael A Eberle, Zivana Tezak, Samir Lababidi, et al. Best practices for benchmarking germline small-variant calls in human genomes. Nature Biotechnology, 37(5):555–560, 2019. [14] Kishwar Shafin, Trevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, et al. Nanopore sequencing and the shasta toolkit enable efficient de novo assembly of eleven human genomes. Nature Biotechnology, 38(9):1044–1053, 2020. [15] Mikhail Kolmogorov, Derek M Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy PL Smith, et al. metaflye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11):1103–1110, 2020. [16] Franka J Rang, Wigard P Kloosterman, and Jeroen de Ridder. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biology, 19(1):90, 2018. [17] Liang Xue, Hirohito Yamazaki, Ren Ren, Meni Wanunu, Aleksandar P Ivanov, and Joshua B Edel. Solid-state nanopore sensors. Nature Reviews Materials, 5(12):931–951, 2020. [18] Yuhui He, Makusu Tsutsui, Yue Zhou, and Xiang-Shui Miao. Solid-state nanopore systems: from materials to applications. NPG Asia Materials, 13(1):48, 2021. [19] Liu Xu and Masahide Seki. Recent advances in the detection of base modifications using the nanopore sequencer. Journal of Human Genetics, 65(1):25–33, 2020. [20] Kefan Wang, Shanyu Zhang, Xiao Zhou, Xian Yang, Xinyue Li, Yuqin Wang, Pingping Fan, Yunqi Xiao, Wen Sun, Panke Zhang, et al. Unambiguous discrimination of all 20 proteinogenic amino acids and their modifications by nanopore. Nature Methods, 21(1):92–101, 2024. [21] Lauren M Petersen, Isabella W Martin, Wayne E Moschetti, Colleen M Kershaw, and Gregory J Tsongalis. Third-generation sequencing in the clinical laboratory: exploring the advantages and challenges of nanopore sequencing. Journal of Clinical Microbiology, 58(1):10–1128, 2019. [22] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018. [23] John D Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(03):90–95, 2007. [24] Michael L Waskom. Seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. [25] Adam C English, Vipin K Menon, Richard A Gibbs, Ginger A Metcalf, and Fritz J Sedlazeck. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biology, 23(1):271, 2022. [26] https://s3.amazonaws.com/zymo-files/biopool/d6331.refseq.zip. 16 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 7 Supplementary information 7.1 Supplementary figures Figure 10: Architecture of the pre-training model for base-calling 17 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 11: Performance of fine-tuned basecalling model Figure 12: Training (a) and validation (b) loss in the pre-train stage of the basecalling model. 18 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Figure 13: Coverage depth of CycloneSEQ data in the HG002 genome. 19 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 7.2 Supplementary tables Species Relative abundance (%) Genome size (Mb) Assembled fraction (%) Faecalibacterium prausnitzii 14 2.914 100 Veillonella rogosae 14 2.158 100 Roseburia hominis 14 3.463 100 Bacteroides fragilis 14 5.167 99.988 Prevotella corporis 6 2.947 99.977 Bifidobacterium adolescentis 6 2.090 97.067 Fusobacterium nucleatum 6 2.448 99.96 Lactobacillus fermentum 6 1.905 99.997 Clostridioides difficile 1.5 4.209 99.894 Akkermansia muciniphila 1.5 2.851 100 Table 1: The relative abundance, genome size, and assembled fraction of different species in the mock metagenome sample. Species with relative genomic DNA or genome copy abundances below 1% are excluded. Species 16S rRNA CN (truth) 16S rRNA CN (assembly) 16S rRNA length (truth) 16S rRNA length (assembly) F aecalibacterium prausnitzii 6 6 1509 1508 V eillonella rogosae 4 4 1561 1561 Roseb uria hominis 4 4 1524 1528 Bacter oides fragilis 6 6 1527 1525 Pr evotella corporis 4 4 1528 1527 Bifidobacterium adolescentis 5 4 1528 1526 Fusobacterium nucleatum 5 5 1514 1514 Lactobacillus fermentum 5 5 1573 1572 Clostridioides difficile 12 12 1501 1500 Akk ermansia muciniphila 3 3 1509 1509 Table 2: Statistics of 16S rRNA copy numbers (CN) and lengths of different species in the mock metagenome sample. Species with relative genomic DNA or genome copy abundances below 1% are excluded. Species 5S rRNA CN (truth) 5S rRNA CN (assembly) 5S rRNA length (truth) 5S rRNA length (assembly) Akk ermansia muciniphila 3 3 108 108 Bacter oides fragilis 6 6 106 106 Bifidobacterium adolescentis 6 5 109 109 Clostridioides difficile 11 11 105 106 F aecalibacterium prausnitzii 6 6 112 112 Fusobacterium nucleatum 5 5 107 107 Lactobacillus fermentum 5 5 112 112 Pr evotella corporis 4 4 109 109 Roseb uria hominis 4 4 100 100 V eillonella rogosae 4 4 110 110 Table 3: Statistics of 5S rRNA copy numbers (CN) and lengths of different species in the mock metagenome sample. Species with relative genomic DNA or genome copy abundances below 1% are excluded. 20 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint A PREPRINT - AUGUST 9, 2024 Species 23S rRNA CN (truth) 23S rRNA CN (assembly) 23S rRNA length (truth) 23S rRNA length (assembly) Akk ermansia muciniphila 3 3 2830 2830 Bacteroides fragilis 6 6 2880 2880 Bifidobacterium adolescentis 4 5 3040 3049 Clostridioides difficile 12 13 2893 2894 F aecalibacterium prausnitzii 6 6 2831 2831 Fusobacterium nucleatum 5 5 2892 2891 Lactobacillus fermentum 5 5 2917 2917 Pr evotella corporis 4 4 2892 2894 Roseb uria hominis 4 4 2884 2885 V eillonella rogosae 4 4 2927 2928 Table 4: Statistics of 23S rRNA copy numbers (CN) and lengths of different species in the mock metagenome sample. Species with relative genomic DNA or genome copy abundances below 1% are excluded. 21 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 20, 2024. ; https://doi.org/10.1101/2024.08.19.608720doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-NC-ND-4.0