Genome-Guided Generative Adversarial Learning enables nanopore adaptive sequencing | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Method Article Genome-Guided Generative Adversarial Learning enables nanopore adaptive sequencing Yixiang Zhang, Pingping Sun, Jiarong Zhang, Kechen Fan, Zhiguo Fu, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8931691/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Nanopore adaptive sequencing enables real-time target enrichment, yet current deep-learning methods require costly, sample-specific experimental training data. To address this, we developed GANBase, a genome-guided generative adversarial learning framework, which is trained exclusively on reference sequences and incorporates Monte Carlo Tree Search-based Rollout strategy for model training. GANBase demonstrates robust performance in target enrichment and host depletion across diverse scenarios. In live adaptive sequencing experiments, it remains effective despite significant pore loss or flow cell version updates, providing a data-independent solution that significantly expands the utility of real-time targeted sequencing. Nanopore sequencing adaptive sequencing read until Generative Adversarial Network (GAN) real-time targeted sequencing Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Background Host DNA depletion remains a key issue in pathogen detection and metagenomic sequencing 1 – 5 . Typical pre-treatment approaches employ biochemical experiments, such as digesting host DNA with specific nucleases 6 or using methyl-CpG binding proteins for selective binding 7 . These often suffer from limited applicability, increased experimental complexity, and high processing costs 3 , 8 . Oxford Nanopore Technologies (ONT) addresses this issue through ‘Read Until’ interface 9 , which allows DNA molecules to be classified in real time as they pass through nanopores 10 – 13 . Once the non-target molecule is classified, the system reverses the electrical current, actively ejecting the molecules and then sequencing a new strand 14 . This mechanism facilitates the rapid enrichment of target DNA molecules within a short period. Thus, the development of more efficient DNA sequence classification algorithms tailored for adaptive sequencing has become a central focus of related research. Existing computational methods can be categorized into two groups: (1) alignment-based methods and (2) deep learning-based methods. Alignment-based methods identify target molecules by matching either nanopore signals or basecalled reads against reference genomes. Representative tools include the method proposed by Loose et al. 9 using the Dynamic Time Warping (DTW) algorithm, UNCALLED by Kovaka et al. 15 , Readfish by Payne et al. 16 , and subsequent DTW variants such as sDTW 17 and cwSDTWNano 18 . Although alignment-based methods demonstrate high accuracy and efficiency in practice, such methods face computationally intensive and high memory usage challenges. The official documentation 19 also highlights the key limitation of adaptive sequencing: on RAM-limited devices such as the MK1C, it’s almost impossible to perform large background depletion (more than 125 Mb), primarily due to the computational burden of sequence alignment. To address this issue, researchers have explored end-to-end deep learning approaches for adaptive sequencing. SquiggleNet 20 is the first deep learning method, leveraging ResNet 21 to classify Zymo metagenome versus human host reads. However, since supervised models can only perform classification on specific species, to expand the available scenarios, Senanayake et al. 22 addressed the lack of generalizability of SquiggleNet for SARS-CoV-2 and yeast detection with DeepSelectNet, while Danilevsky et al. 23 and Sneddon et al. 24 focused on model development targeting mitochondrial DNA and non-coding RNA, respectively. Regarding the interpretability and performance limitations—including speed and validation robustness—Lin et al. 25 introduced the NanoDeep. More recently, Fan et al. 26 proposed a swift model called ReadCurrent, which combines high accuracy with low computational overhead. Although deep learning methods have demonstrated advantages in speed and accuracy, current models still face some key limitations that cannot be ignored. First, existing models are based on supervised learning frameworks, which are built using labeled data 27 . When encountering unseen reads from unknown pathogens in the sample, the model would misclassify these reads 28 , thereby hindering target enrichment or host DNA depletion efficacy (Fig. 1 a). Second, models trained on nanopore electrical signals are usually linked to a specific version of the flow cell. When adaptive sequencing is performed on a new version flow cell, the model needs to be retrained using new sequencing data. This requires additional sequencing experiments to generate signal data as training data. These constraints highlight the limited flexibility and scalability of supervised learning frameworks, posing serious challenges to the broader adoption of adaptive sequencing. Based on the above considerations, we design a modular neural network architecture for adaptive sequencing, comprising a basecaller module and a classifier module. The basecaller employs the official open-source model provided by ONT, thereby obviating the need for users to retrain the model following flow cell version updates. The classifier is designed to functionally substitute for sequence alignment, having been trained to identify the classification boundary inherent to the target species (Fig. 1 a). To implement this classification capability, we proposed GANBase, an unsupervised learning framework for adaptive sequencing comprising a pre-trained generator and a discriminator (Fig. 1 b). By iteratively distinguishing real target sequences from synthetic sequences generated by the generator, the discriminator effectively captures the distribution boundary of target sequences. We adopted a Rollout Policy 29 based on Monte Carlo Tree Search (MCTS) 30 for discrete sequence backpropagation, which estimates the reward value by sampling complete trajectories via rollouts. In general, GANBase relies solely on reference genome sequences for training and integrates with the corresponding basecaller to facilitate real-time adaptive sequencing. To validate this framework, we first assessed the feasibility of the architecture on multiple simulated datasets derived from the ZymoBIOMICS High Molecular Weight (HMW) DNA Standard D6322 (referred to as ‘Zymo mock’), demonstrating GANBase's capacity for small genome enrichment. We then conducted a systematic assessment across diverse host organisms to evaluate performance generalizability. Finally, we deployed GANBase in live nanopore sequencing experiments to verify its efficacy in real-world adaptive sequencing scenarios. Results GANBase can accurately classify unseen reads in simulated microbial enrichment experiments To validate the enrichment ability of GANBase, we conducted a systematic performance assessment on the sequencing data of eight microorganisms from the ZymoBIOMICS HMW DNA Standard D6322. First, we trained a GANBase model for each of the eight species, using the corresponding reference genomes (Supplementary Table S7). Then we assessed models on the balanced (target: background = 14,000:14,000) and imbalanced (target: background = 2,000:14,000) datasets, adopting a One-vs-Rest (OvR) strategy (Fig. 2 a). The results demonstrate that the unsupervised model GANBase has classification ability, with median ROC-AUC, PR-AUC and F1-scores exceeding 0.7 (Fig. 2 b; Supplementary Tables 8–9). As the alignment-based methods are the most commonly used in adaptive sequencing, we then compared GANBase with Minimap2. GANBase achieved recall values ranging from 82.81% ( S. cerevisiae ) to 93.03% ( P. aeruginosa ) on the balanced dataset (Fig. 2 c), while Minimap2 delivered recall values ranging from 82.67% ( S. cerevisiae ) to 98.6% ( P. aeruginosa ). Non-parametric permutation test showed that there’s no significant difference between the two methods (p > 0.05, Supplementary Method), indicating that GANBase achieves comparable classification performance to Minimap2. At the same time, GANBase demonstrates an advantage in speed (~ 30-fold improvement, Fig. 2 d). To quantify enrichment performance, we defined the in silico enrichment ratio as the quotient of the target species' abundance in the enriched dataset and abundance without enriched. GANBase demonstrated its capacity of target enrichment, with in silico enrichment ratios for all eight species were greater than 1, and UMAP visualization further shows that GANBase can clearly separate target and non-target sequences (Fig. 2 e and f). Host DNA depletion in simulated host-pathogen mixed datasets using GANBase The human genome serves as the predominant host background in pathogen detection 6 , 31 . Therefore, we first evaluated GANBase’s performance on the human host depletion. We trained GANBase using human reference genome the Genome Reference Consortium Human Build 38 (GRCh38) 32 , 33 . We implemented the adaptive sequencing pipeline by integrating Bonito v4.3 for basecalling and the trained GANBase for classification. For comparison purposes, we chose the existing signal-based supervised models, including NanoDeep, SquiggleNet, and DeepSelectNet. To mitigate the potential biases of these models training data (Human and Zymo), we retrained all three models using the corresponding public nanopore sequencing data 34 (Supplementary Methods). For testing data, we conducted sequencing experiments on Zymo mock, Yeast, and SARS-CoV-2 (see Method). We mixed the publicly available human sequencing reads and the in-house reads at a 4:1 ratio, with a total of 100,000 reads per test set. As for result, GANBase achieved the best performance in terms of accuracy, precision, specificity, speed, and in silico enrichment ratio across datasets (Fig. 3 a-c). Such simulated experiments demonstrate that GANBase has a potential advantage in human host depletion scenarios across deep learning methods. To assess whether the model can tolerate individual genomic differences, we tested GANBase on different hosts (NA12878, NA24385) mixed with different target pathogen (SARS-CoV-1, Ebola, and Phage) separately. GANBase demonstrated consistently high classification efficacy, with ROC-AUC and PR-AUC values exceeding 88.9% and 87.0% across all tested host-pathogen combinations (Fig. 3 c). Notably, the observed in silico enrichment ratios (1.84–1.99) closely approached the theoretical optimum of 2 (Fig. 3 d). The results indicate that GANBase maintains stable classification performance across different individual backgrounds. Given the pivotal role of zoonotic reservoirs in infectious diseases, extending GANBase to non-human hosts is of practical significance 35 . To demonstrate this capability, we trained four distinct GANBase models on the reference genomes of key species, including Anopheles cruzii (NCBI: 68878), Mus musculus (GRcm39, NCBI: 10090), Rhipicephalus microplus (NCBI: 6941), and Drosophila melanogaster (NCBI: 7227). To assess the performance of the model, we generated eight simulated datasets by combining host and pathogen reads in a 1:1 ratio, utilizing data from the SRA database (refer to method). GANBase achieved ROC-AUC values from 52.84% (tick & CCHFV) to 98.38% (mouse & Y. pestis), and PR-AUC values ranging from 54.17% (tick & Zymo mock) to 98.50% (mouse & Y. pestis). Notably, while the tick datasets yielded lower discriminative scores (ROC-AUC and PR-AUC), complementary metrics confirmed that remained highly effective at depleting host sequences in these samples. In particular, for the tick & CCHFV dataset, GANBase achieved a high specificity of 99.92% and an in silico enrichment ratio of 1.96. For the tick & Zymo mock dataset, it achieved a high negative predictive value (NPV) of 91.00% and an in silico enrichment ratio of 1.56. Across all samples, in silico enrichment ratios ranged from 1.56 to 1.97 (Fig. 3 f), consistently validating GANBase’s ability to effectively deplete host sequences across diverse genomic backgrounds. Interpretability analysis shows that GANBase has learned effective classification features To investigate whether the adaptive enrichment capability of GANBase is driven by biologically meaningful sequence features rather than spurious correlations, we analyzed the internal sequence representations learned by the model. T-SNE analysis showed that target and background reads formed clearly separable clusters in the embedding space in the four host-target combinations (Mosquito & Zika, Mouse & Y. pestis, Tick & CCHFV, Fly & Zymo mock; Fig. 4 a-d). Although there was some overlap at the boundary regions, overall, GANBase successfully mapped reads from both target and non-target species to different regions of the feature space. We then performed motif analysis using WebLogo 36 . The most-attended k-mer motifs revealed that sequences which were classified as human possessed relatively higher A/T proportion (59.90%). This finding is consistent with both our calculated results (59.71%) and the reported human genome average (59.13% 37 ), with no significant difference. Conversely, sequences classified as non-human exhibited a relatively lower proportion of A/T (predicted 49.92%, reference 51.30%). The inter-group differences between the human and Zymo mock were highly significant ( p < 0.0001). These results indicate that GANBase studied the sequence motifs patterns which have biologically interpretable and target-specific, which form the basis of its adaptive sequencing performance. Real-world adaptive sequencing experiment for comparison To evaluate the performance of GANBase in real-time adaptive sequencing, we conducted two wet experiments, using different version flow cells (R9.4.1 and R10.4.1) respectively (Fig. 5 a). The mixture samples were prepared using NA12878 and Zymo mock (D6322). In the R9 sequencing experiment, we performed sequential sequencing on the same flow cell, running SquiggleNet followed by GANBase (Fig. 5 b). Despite a 41.41% reduction in active nanopores from the initial run, GANBase remained effective at depleting human DNA. In all barcodes, GANBase outperformed SquiggleNet in terms of in silico enrichment ratio, recall and precision (Fig. 5 d and f, Supplementary Table 29). In the R10 sequencing experiment, the sequencing was performed on two separate flow cells, the results showed that the performance gap between the two methods widened significantly. Specifically, GANBase achieved in silico enrichment ratios of 1.98-fold, 4.37-fold, and 6.97-fold. Conversely, the in silico enrichment ratio of SquiggleNet dropped below 1.0 (0.75-fold to 0.65-fold), with both low recall and precision, indicating a failure to enrichment (Fig. 5 e and g). This may be because SquiggleNet was only trained on R9 data, resulting in a large number of misclassifications in R10 tests. In contrast, we obervsed that GANBase show a better performance of R10 than R9 test. GANBase utilizes Bonito for basecalling, enabling it to adapt to newer flow cell versions. In summary, despite facing more challenging experimental conditions, GANBase performed comparably to, or even better than, SquiggleNet, and is not limited by the version of the sequencing flow cell. This demonstrates the robust performance of GANBase and its ability to focus limited sequencing resources on important target sequences, thereby significantly improving sequencing efficiency. Discussion Nanopore adaptive sequencing enriches or depletes target reads by determining the origin of DNA/RNA molecules in real time. Existing learning-based methods use supervised architectures which uses labeled signal datasets and are unable to classify unseen data accurately. Moreover, signal data generated by different flow cell versions is inconsistent. Once the version of flow cells updates, the corresponding model should be re-trained with new sequencing data, which inevitably increases the computational and experimental overhead for data acquisition. As for alignment-based methods, alignment algorithms are computationally intensive and memory-intensive, which are slower than deep learning methods and can’t work well in handheld devices with limited computing. To address this issue, we developed GANBase, an unsupervised model based on GAN that integrates an MCTS reward scoring mechanism to achieve backpropagation on discrete sequences. Experimental cases on 13 species and live adaptive sequencing experiments demonstrate that the GANBase, built solely based on the reference genome, can effectively classify targets and overcome the limitations mentioned above. Although GANBase’s enrichment capability are slightly inferior to alignment-based methods (Supplementary Table 30), its speed and memory advantages still indicate its application potential (see Supplementary Material). In particular, GANBase uses less memeory (only 2.6 MB for model parameters) than alignment-based methods (e.g. Minimap2, which requires 7.2GB to load the index file), and run faster (0.47 milliseconds per read, compared to 24.14 milliseconds for Minimap2). However, there are some key limitations in this study. Firstly, the current implementation of the sequencing pipeline, which was developed in Python, lacks in terms of extensive engineering optimization. Its computational throughput and latency have yet to reach the performance levels of native C or C + + implementations. Secondly, key operational parameters within the adaptive sequencing pipeline require further systematic tuning and optimization, such as signal extraction intervals, chunking strategies, and batch sizes, as these are critical factors that directly impact the adaptive sequencing performance. Thirdly, our difference tolerance tests involved only two individuals, therefore it requires further verification. Conclusion In summary, GANBase represents a scalable solution for adaptive sequencing, facilitating its translation into clinical and field applications. GANBase’s lightweight nature enables it to be used in a wide range of deployment scenarios, ranging from high-throughput centers to remote field locations. Even in computationally constrained environments—specifically those utilizing portable sequencers (e.g., MinION Mk1C) or edge computing devices—users can execute real-time target enrichment or depletion by simply loading pre-trained reference weights tailored to specific taxa. Consequently, GANBase emerges as a potential solution for the on-site detection of emerging pathogens, offering profound implications for global genomic surveillance and rapid public health response. Methods Dataset construction The ZymoBIOMICS HMW DNA Standard D6322 is a mixture of genomic DNA isolated from pure cultures of seven bacterial and one fungal strain, including B. subtilis , E. faecalis , E. coli , L. monocytogenes , P. aeruginosa , S. enterica , S. aureus , and S. cerevisiae . To ascertain the viability of GANBase, we undertook the training of multiple classifiers on distinct microbiome genomes separately, using the sequences downloaded from https://s3.amazonaws.com/Zymo-files/BioPool/D6322.refseq.zip . Considering that the sequence decision duration in adaptive sequencing is within 1s 38 , the length of the sequence input to the model should be shorter than 400bp (The speed of DNA molecules passing through the pore is 450bp/s). To minimize the decision duration while maintaining the decision accuracy, we constructed 16 training sets with different model input lengths on S. aureus (Supplementary Table 1). For the other seven reference genomes, we split the reference sequences into segments using sliding windows with a predefined window length (200 bp) and step size (100 bp), and constructed seven training sets. For all trained models related to the Zymo mock, we built multiple test sets from the in-house Zymo mock sequencing data and conducted a performance evaluation. The sequencing data were yielded using MinKNOW v23.11.4 39 and basecalled using Guppy basecaller v6.2.1 (nanoporetech.com/zh/document/Guppy-protocol). Once the adapter sequence and barcode sequence had been trimmed, each read was aligned against the Zymo mock reference using Minimap2 v2.22 40 . By extracting the first N-bp of the sequences, we processed the reads aligned to each species and built the test sets. In particular, we built balanced test sets and imbalanced test sets for models trained on eight species of Zymo mock, which contain 28,000 reads and 16,000 reads, respectively. For the depletion of human host DNA, the training data was built using the GRCh38. The reference genome sequences of autosomes and sex chromosomes were divided into segments using a sliding window length of 200 bp and a step size of 200 bp, which corresponds to approximately 1-fold coverage of the entire genome. To evaluate performance, we constructed 33 test sets by manually combining in-house nanopore sequencing data, including the Zymo mock DNA and SARS-CoV-2 standards from Twist, with publicly available nanopore sequencing data for NA12878 and NA24385 in varying mixing ratios (Supplementary Table 3). The NA12878 and NA24385 datasets were sourced from the Oxford Nanopore Human Reference Dataset 41 and the Human Pangenome Reference Consortium ( https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG002/nanopore/ ), respectively. For the zoonotic hosts, we trained four weights of GANBase on the reference genomes, including Anopheles cruzii (mosquito, NCBI Taxonomy ID: 68878), Mus musculus strain (mouse, GRcm39, NCBI ID: 10090), Rhipicephalus microplus (tick, NCBI ID: 6941), and Drosophila melanogaster (fruit fly, NCBI ID:7227), respectively. The simulated host-pathogen mixed datasets were constructed using the data from the SRA database, including: (1) Anopheles mosquito (SRA: DRP012751) & Zika virus (SRA: SRP072852), (2) Anopheles mosquito & West Nile Virus (WNV, SRA: ERR6357505), (3) house mouse (SRA: ERS20299361) & Zymo mock, (4) house mouse & Yersinia pestis (Y. pestis, SRA: SRP576427), (5) Dermacentor silvarum (SRA: SRP565110) & Zymo mock, (6) Dermacentor silvarum & Crimean-Congo Hemorrhagic Fever Virus (CCHFV, SRA: ERP130784). (7) Fruit fly (sequencing data from ONT) & Zymo mock, (8) Fruit fly & denv (SRA: SRR36350780). Each test set comprised a 1:1 mixture of host and pathogen reads. For all the datasets mentioned above, we used discrete tokens {0, 1, 2, 3} to represent nucleotides {A, C, G, T}. Model architecture GANBase is inspired by SeqGAN 42 , a Generative Adversarial Network (GAN) 43 specifically developed for generating real-valued data. In contrast to SeqGAN, GANBase uses the generative model to guide the training of a discriminative model. For the generative model, we used a five-layer Long Short-Term Memory (LSTM) 44 neural network to generate the probabilities of the four nucleotides at each base position. Before the adversarial training process, the generative model was first pretrained using the Maximum Likelihood Estimation (MLE) method. The pretraining was conducted for 15 epochs using the training set. Then the generator generated the nucleotide at each position in turn, according to the conditional probability. For the discriminator model, we used a six-layer Transformer Encoder 45 and a linear layer to determine whether the input sequence is from the training set or not. The discriminator model was also undergoing a pretraining process with five epochs using the sequences from the training set and the pretrained generator. The detailed hyperparameters are shown in Supplementary Table 20–22. Nanopore sequencing experiment for the collected samples We performed nanopore sequencing on the collected samples using a MinION sequencer (MK1B, Oxford Nanopore Technologies, ONT). Libraries were prepared separately for Zymo mock DNA, the fungal component of the Zymo mock (S. cerevisiae), and the SARS-CoV-2 standard (from Twist Bioscience). MinION sequencing was conducted on R9.4.1 flow cells (FLO-MIN106, ONT) according to the manufacturer’s protocol. Basecalling was performed using Guppy (v6.2.1), and the resulting sequences were aligned to their corresponding reference genomes using minimap2 (v2.22). Model training and evaluation In our experiments, we used the reinforcement learning method to train each model on the training dataset. The specific settings were as follows: the batch size was set to 280, meaning that 280 samples were processed in each batch, and the training ran for 50 iterations. To prevent overfitting, we implemented an early stopping strategy, which halted training if the loss did not decrease for five consecutive rounds. For optimization, we used the Adam algorithm for the generator and stochastic gradient descent (SGD) for the discriminator. GANBase is built using Pytorch (v1.10.0) and Python 3.6. Model training and testing were performed on an Ubuntu 20.04.6 system powered by an Intel(R) Xeon(R) Gold 6126T CPU @ 2.60GHz CPU and NVIDIA RTX A5000. In the training process, the generator constructs each position of the sequence sequentially, using a Monte Carlo Tree Search to sample possible sequences for the subsequent positions. These sequences are then passed to the discriminator, which calculates a score as a reward. Then the generator is updated using a policy gradient method. The configuration was designed to balance training efficiency and model performance, ensuring optimal learning outcomes. The adaptive sequencing pipeline is shown in the Supplementary Fig. 10. The experiments were performed on a workstation equipped with an NVIDIA GeForce RTX 4090 graphics card and an Ubuntu 20.04 system. The experiments also used the nanopore sequencer MinION MK1B (ONT) and MinKNOW software (version 23.11.4). We show the size and training time on different reference genomes in Supplementary Table 25–26. Adaptive sequencing of the depletion of host human DNA in the case study We mixed the human DNA standard (NA12878 standard) and microbial DNA standard (Zymo mock) according to different DNA quality ratios and constructed libraries for adaptive sequencing experiments. First, the DNA fragments of the NA12878 standard and the Zymo mock were sheared using gtube, and the sheared target fragment was set to 6 kbp. Pippin HT (Sage Science) was used to screen the sheared DNA fragments, and DNA fragments longer than 6 kbp were retained. After the samples were charged according to their quantitative concentration using the Qubit4.0 nucleic acid quantifier and the Qsep100 Qsep100 biological fragment analyzer, the NA12878 standard and the Zymo mock were mixed in a ratio of 1:1 (200ng:200ng), 4:1 (320ng:80ng), and 9:1 (360ng:40ng) to obtain microbial samples containing human DNA. Then the sequencing library was prepared. All three mixed DNA samples were used to construct nanopore sequencing libraries. The reagents used included NEBNext Ultra II End Repair/dA-tailing Module (New England Biolabs, NEB, USA), Native Barcoding Kit (Oxford Nanopore Technologies, ONT, UK), NEBBlunt/TA Ligase Master Mix (NEB, USA), ligation sequencing kit LSK110 (ONT, UK), NEBNext Quick Ligation Module (NEB, USA), ligation sequencing kit LSK114 (ONT, UK), R9.4.1 Flow Cells, and R10.4.1 Flow Cells. The experimental operation steps were carried out according to the library construction instructions. For the adaptive sequencing, we used the Read Until API provided by ONT and the GANBase model to run the adaptive sequencing script. Considering the parameter requirements of GANBase for input sequences, we set the interval time of API calls to 0.85s. To make GANBase have a faster processing speed during the genotyping process, we used the Bonito basecaller instead of the Guppy basecaller Server originally required by the API. Specifically, we implemented Bonito as a local function, avoiding the need to call additional processes. With this step, we were able to directly import the detected electrical signals and convert them to sequences by Bonito basecaller, significantly reducing the time loss of data processing. During the sequencing process, the adaptive sequencing script used the head of signal of each DNA molecule as input and then analyzed the signal through basecaller and host-depletion model GANBase. The script sent the analysis results to the MinION sequencer (MK1B, ONT) to decide whether to continue sequencing the DNA molecule. If GANBase classifies the DNA molecules as non-host reads, MinION allows them to pass the pore and perform complete sequencing. On the contrary, if the DNA molecules are decided as host reads, MinION will terminate their sequencing and eject the molecules from the pore. It is worth noting that although MinKNOW's adaptive sequencing method uses a sequence alignment strategy, the processing time for each sequence may be shorter than that of the AI-based method, which shows that MinKNOW's adaptive sequencing has undergone a lot of engineering work. Evaluation metrics In this study, we used deep learning-based metrics, in silico enrichment ratio, relative enrichment ratio, and absolute enrichment ratio to perform the assessment. In the task of genome enrichment, the target reads were classified as positive samples, while the host DNA was categorized as negative samples in the task of genome depletion. Therefore, we used true positive (TP), false positive (FP), true negative (TN), and false negative (FN) to calculate the ACC, ROC-AUC, PR-AUC, Precision, Recall, (MCC), and F1 score. The detailed calculation formulas are shown in the Supplementary Materials. The in silico enrichment ratio was measured by calculating the ratio of the percentage of target reads in experiments conducted with adaptive sequencing compared to those without. Threshold choice Since classification performance is affected by the threshold, such as accuracy and in silico enrichment ratio, we evaluated the impact of the threshold on classification results under 11 different mixing ratios of hosts and pathogens to select an appropriate classification threshold. These ratios ranged from moderately imbalanced (5:1 to 10k:1) to extremely imbalanced (100k:1). Analysis of the Matthews correlation coefficient (MCC) and F1-score curves under different thresholds revealed that for host removal tasks, the model exhibits stronger performance in the lower threshold range (Supplementary Fig. 5). Moreover, in highly unbalanced scenarios, overly aggressive threshold settings can be counterproductive, as maintaining recall is crucial for preserving rare target sequences. Considering both the in silico enrichment ratio heatmap and the changes in various classification metrics with the threshold, we found that when the threshold is set to 0.1, the model maintains optimal overall performance across different mixing ratios. At this threshold, even under extreme dilution conditions of 100k:1, the model can still balance the detection limit and false positive rate, thus ensuring the reliability of subsequent analyses. Declarations Ethics declarations Ethics approval and consent to participate No ethnical approval was required for this study. Consent for publication Not applicable. Competing interests The authors declare no competing interests. Funding Not applicable. Author Contribution Y.Z. conceived the project, designed and performed the in-silico experiments, conducted all data analysis, and drafted the manuscript. P.S. guided the work and revised the manuscripts. J.Z. conducted wet laboratory experiments, including DNA extraction, amplification, Nanopore sequencing, and adaptive sequencing. K.F. performed the in-silico part of the adaptive sequencing wet lab experiment. Z.F. and X.B. revised the manuscripts. M.N. offered advice and guidance on the study and revised the manuscript. Z.R. conceived the project and designed the in-silico experiments, and drafted the manuscript. All authors contributed to the article and approved the submitted version. Data Availability Genome Reference Consortium Human Build 38 can be obtained from [https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\_000001405.26/](https:/www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26) . The reference genomes of the fruit fly, mosquito, mouse, and tick are from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\_000001635.27/, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\_000001215.4, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\_943734635.1/, and https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\_013339725.1/, respectively. The sequencing data of NA24385 can be obtained from [https://github.com/marbl/HG002/blob/main/Sequencing\_data.md](https:/github.com/marbl/HG002/blob/main/Sequencing_data.md) . The sequencing data of NA12878 can be obtained from [https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md](https:/github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md) . The fruit fly sequencing data were obtained from the Oxford Nanopore Open Data Project (source: s3://ont-open-data/contrib/melanogaster\_bkim\_2023.01/flowcells/D.melanogaster.R1041.400bps/). The sequencing data were obtained from the following SRA entries: mosquito (DRP012751), mouse (ERS20299361), tick (SRP565110), Zika virus (SRP072852), WNV (ERR6357505), Yersinia pestis (SRP576427), and CCHFV (ERP130784). Code availability The GANBase software is available at https://github.com/renzilin/GANBase . References Marotz CA, et al. Improving saliva shotgun metagenomics by chemical host DNA depletion. Microbiome. 2018;6:42. Heravi FS, Zakrzewski M, Vickery K, Hu H. Host DNA depletion efficiency of microbiome DNA enrichment methods in infected tissue samples. J Microbiol Methods. 2020;170:105856. Ganda E et al. DNA Extraction and Host Depletion Methods Significantly Impact and Potentially Bias Bacterial Detection in a Biological Fluid. mSystems 6. 10.1128/msystems.00619 – 21 (2021). Chen Y-C, et al. Optimization of Metagenomic Next-Generation Sequencing Workflow with a Novel Host Depletion Method for Enhanced Pathogen Detection. Mol Diagn Ther. 2025;29:689–99. Wang C, et al. Benefits and challenges of host depletion methods in profiling the upper and lower respiratory microbiome. npj Biofilms Microbiomes. 2025;11:130. Charalampous T, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37:783–92. Miller S, et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res. 2019;29:831–42. Hasan MR, et al. Depletion of Human DNA in Spiked Clinical Specimens for Improvement of Sensitivity of Pathogen Detection by Next-Generation Sequencing. J Clin Microbiol. 2016;54:919–27. Loose M, Malla S, Stout M. Real-time selective sequencing using nanopore technology. Nat Methods. 2016;13:751–4. Deamer DW, Akeson M. Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends Biotechnol. 2000;18:147–51. Restrepo-Pérez L, Joo C, Dekker C. Paving the way to single-molecule protein sequencing. Nat Nanotech. 2018;13:786–96. Marquet M, et al. Evaluation of microbiome enrichment and host DNA depletion in human vaginal samples using Oxford Nanopore’s adaptive sequencing. Sci Rep. 2022;12:4000. Meyer D, et al. Unlocking the full potential of nanopore sequencing: tips, tricks, and advanced data analysis techniques. Nucleic Acids Res. 2026;54:gkag023. Loose M, Malla S, Stout M. Real-time selective sequencing using nanopore technology. Nat Methods. 2016;13:751–4. Kovaka S, Fan Y, Ni B, Timp W, Schatz MC. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol. 2021;39:431–41. Payne A, et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat Biotechnol. 2021;39:442–50. Shih PJ, Saadat H, Parameswaran S, Gamaarachchi H. Efficient real-time selective genome sequencing on resource-constrained devices. Gigascience. 2022;12:giad046. Han R, Wang S, Gao X. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing. Bioinformatics. 2020;36:1333–43. Adaptive sampling. Oxford Nanopore Technologies https://nanoporetech.com/document/adaptive-sampling (2020). Bao Y, et al. SquiggleNet: real-time, direct classification of nanopore signals. Genome Biol. 2021;22:298. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016). 10.1109/CVPR.2016.90 Senanayake A, Gamaarachchi H, Herath D, Ragel R. DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing. BMC Bioinformatics. 2023;24:31. Danilevsky A, Polsky AL, Shomron N. Adaptive sequencing using nanopores and deep learning of mitochondrial DNA. Brief Bioinform. 2022;23:bbac251. Sneddon A et al. Biochemical-free enrichment or depletion of RNA classes in real-time during direct RNA sequencing with RISER. 2022.11.29.518281 Preprint at https://doi.org/10.1101/2022.11.29.518281 (2024). Lin Y, et al. NanoDeep: a deep learning framework for nanopore adaptive sampling on microbial sequencing. Brief Bioinform. 2023;25:bbad499. Fan K, et al. ReadCurrent: a VDCNN-based tool for fast and accurate nanopore selective sequencing. Brief Bioinform. 2024;25:bbae435. Shetty SH, Shetty S, Singh C, Rao A. Supervised Machine Learning: Algorithms and Applications. in Fundamentals and Methods of Machine and Deep Learning 1–16 (John Wiley & Sons, Ltd, 2022). 10.1002/9781119821908.ch1 Zhou Z-H. A brief introduction to weakly supervised learning. Natl Sci Rev. 2018;5:44–53. Reinforcement learning and optimal control | Dimitri Bertsekas. https://faculty.engineering.asu.edu/bertsekas/books/reinforcement-learning-and-optimal-control/ Świechowski M, Godlewski K, Sawicki B, Mańdziuk J. Monte Carlo Tree Search: A Review of Recent Modifications and Applications. Artif Intell Rev. 2023;56:2497–562. Shi Y, Wang G, Lau HC-H, Yu J. Metagenomic Sequencing for Microbial DNA in Human Samples: Emerging Technological Advances. Int J Mol Sci. 2022;23:2181. Guo Y, et al. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics. 2017;109:83–90. Human genome reference builds -. GRCh38 or hg38 - b37 - hg19. GATK https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19 (2024). nanopore-wgs-consortium/NA12878. nanopore-wgs-consortium. (2026). Vashisht V et al. Genomics for Emerging Pathogen Identification and Monitoring: Prospects and Obstacles. BioMedInformatics 3, 1145–1177 (2023). Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: A Sequence Logo Generator. Genome Res. 2004;14:1188–90. Piovesan A, et al. On the length, weight and GC content of the human genome. BMC Res Notes. 2019;12:106. Edwards HS, et al. Real-Time Selective Sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria. Sci Rep. 2019;9:1–11. Jain M, Olsen HE, Paten B, Akeson M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 2016;17:239. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100. Jain M, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–45. Yu L, Zhang W, Wang J, Yu Y. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. Preprint at https://doi.org/10.48550/arXiv.1609.05473 (2017). Goodfellow IJ et al. MIT Press, Cambridge, MA, USA,. Generative adversarial nets. in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 2672–2680 (2014). Hochreiter S, Schmidhuber J. Long Short-term Memory. Neural Comput. 1997;9:1735–80. Vaswani A et al. Attention Is All You Need. Preprint at http://arxiv.org/abs/1706.03762 (2017). Additional Declarations No competing interests reported. Supplementary Files Supplementary.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8931691","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Method Article","associatedPublications":[],"authors":[{"id":595570029,"identity":"bc0bae28-6044-45bd-a95c-05e0c329e722","order_by":0,"name":"Yixiang Zhang","email":"","orcid":"","institution":"Northeast Normal University","correspondingAuthor":false,"prefix":"","firstName":"Yixiang","middleName":"","lastName":"Zhang","suffix":""},{"id":595570030,"identity":"c752b44d-657f-4f1e-ab73-40aa79db5fb3","order_by":1,"name":"Pingping Sun","email":"","orcid":"","institution":"Northeast Normal University","correspondingAuthor":false,"prefix":"","firstName":"Pingping","middleName":"","lastName":"Sun","suffix":""},{"id":595570031,"identity":"55d1d846-cd24-4989-aaaa-c843aa710137","order_by":2,"name":"Jiarong Zhang","email":"","orcid":"","institution":"Shanxi Medical University","correspondingAuthor":false,"prefix":"","firstName":"Jiarong","middleName":"","lastName":"Zhang","suffix":""},{"id":595570032,"identity":"df05128e-d680-4f2b-8717-02a9721f1b92","order_by":3,"name":"Kechen Fan","email":"","orcid":"","institution":"Academy of Military Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Kechen","middleName":"","lastName":"Fan","suffix":""},{"id":595570033,"identity":"29d1457f-bcd3-4daa-9311-850fd8a458c7","order_by":4,"name":"Zhiguo Fu","email":"","orcid":"","institution":"Northeast Normal University","correspondingAuthor":false,"prefix":"","firstName":"Zhiguo","middleName":"","lastName":"Fu","suffix":""},{"id":595570034,"identity":"4eb6c336-7abc-413d-85c0-86969f773d37","order_by":5,"name":"Xiaochen Bo","email":"","orcid":"","institution":"Academy of Military Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Xiaochen","middleName":"","lastName":"Bo","suffix":""},{"id":595570035,"identity":"54231cf5-011e-4b8d-b7b3-24e48309327a","order_by":6,"name":"Di Guan","email":"","orcid":"","institution":"Beijing Academy of Science and Technology (Beijing Center for Physical and Chemical Analysis)","correspondingAuthor":false,"prefix":"","firstName":"Di","middleName":"","lastName":"Guan","suffix":""},{"id":595570036,"identity":"fe34ab6a-1a8e-40b9-9a0b-bbb2bd3f36a7","order_by":7,"name":"Ming Ni","email":"","orcid":"","institution":"Academy of Military Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ming","middleName":"","lastName":"Ni","suffix":""},{"id":595570037,"identity":"183976ee-df3e-465e-97cb-ae6455b2ee54","order_by":8,"name":"Zilin Ren","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA1ElEQVRIiWNgGAWjYDACZuaGA2AGewOQKGBgkCCshRGqhQdIHTAgRgsDYwOElkggUot8O2Pj4QKGe3Lmko+PSX8wsJGTbGB++OgGPjuaGRsOz2AoNracnZYmccAgzViagc3YOAefV4B+OczDkJC44XaOGVDL4cR5DDxs0vi0sEG11G+4eYZILTxQLQkGN3ggWmYT0iIB1mKQYLjhTFqyxRmgXySbCfhFvv/w4c88FQnyBscPH7xRUWEjJ3G8+eFjfFogwACZw0xQ+SgYBaNgFIwCQgAA+kVDmSL2L3UAAAAASUVORK5CYII=","orcid":"","institution":"Northeast Normal University","correspondingAuthor":true,"prefix":"","firstName":"Zilin","middleName":"","lastName":"Ren","suffix":""}],"badges":[],"createdAt":"2026-02-21 08:08:28","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8931691/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8931691/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103551534,"identity":"a8b02c43-f272-4b3b-a0d3-df25d54481be","added_by":"auto","created_at":"2026-02-27 02:35:24","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":283509,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eMotivation and unsupervised GANBase Model. \u003c/strong\u003e(a) Shows the process of nanopore adaptive sequencing with host depletion based on a classification neural network. Existing methods are mainly based on supervised learning. The principle is to find the classification boundary of samples through neural networks. However, when the model encounters unseen species sequences, it loses the ability to classify. This study proposes GANBase, an unsupervised learning model based on the generative adversarial network (GAN) framework. GANBase uses a generator to randomly sample sequences and a discriminator to distinguish between target and generated sequences. This process gradually fits the classification boundary of the target species sequences during adversarial learning. (b) GANBase is trained using the reference genome of the target species, and the input file of the model is the base sequences after sliding window processing. The generator first uses real samples for pre-training and then generates simulated sequences by inputting random noise. The discriminator distinguishes between real and generated sequences.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/6096ccc78429c025db1748cd.png"},{"id":103551538,"identity":"36167421-6865-4696-bd96-94cbbb89c009","added_by":"auto","created_at":"2026-02-27 02:35:24","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":520471,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance of GANBase in enriching target sequences for Zymo mockspecies\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e(a) shows the model training and evaluation workflow. This workflow uses reference genomes from eight zymo as input to train eight target enrichment models. The models are evaluated using two types of simulated datasets: balanced data (1:1 ratio, total reads 28,000) and imbalanced data (1:7 ratio, total reads 16,000). Evaluation metrics include ROC-AUC, PR-AUC, F1 score, recall, simulated enrichment rate, and decision speed. (b) Box plots of the distribution of the eight model classification performance metrics are shown. The center line represents the median, and the cross symbol (×) represents the mean. (c) The recall performance of the classification models established for eight target species is shown (i.e., the proportion of target species sequences correctly predicted by the model out of all target species sequences in the test set). The horizontal axis represents the classification models of the eight target species, and the vertical axis represents the recall value. Blue represents the performance of sequence alignment, and red represents the performance of GANBase classification. (d) Average sequence processing time comparison. We calculated the judgment time under different batch sizes for a test set consisting of 2,000 sequences from each species (sequence length is 200bp). Red represents the average processing time for GANBase to identify each sequence, blue represents the processing time for Minimap2 using the default number of threads(n=3), and green represents the processing time for Minimap2 using 32 threads. (e) In silico enrichment factor across eight zymo species. The red dashed line represents the baseline enrichment ratio(y = 1). (f-g) Umap visualization of sequence embeddings from the target species and the other seven Zymo mock species. Red represents the target species (f) \u003cem\u003eS. aureus\u003c/em\u003e and (g) \u003cem\u003eE. coli\u003c/em\u003e, while blue represents the corresponding other seven Zymo mock species. There are 2,000 target species and non-target species sequences (positive: negative = 1:1).\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/e627d4dbbcd1e3636cd25945.jpeg"},{"id":103551535,"identity":"92cdc5ab-346e-4526-868f-43680b96e43c","added_by":"auto","created_at":"2026-02-27 02:35:24","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":486009,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance Evaluation and Generality Validation of Host Genome Removal Based on Simulated Datasets.\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e(a-b) Comparison of host genome removal performance of different methods on simulated datasets. (a) The radar chart shows the performance comparison of GANBase with NanoDeep, SquiggleNet, and DeepSelectNet on three simulated datasets (NA12878 mixed with Zymo, yeast, and SARS-CoV-2). Evaluation metrics include accuracy (ACC), ROC-AUC, PR-AUC, precision, specificity, and negative predictive value (NPV). GANBase (red line) shows excellent and balanced performance across all metrics. (b) The upper figure shows the\u003cem\u003ein silico \u003c/em\u003eenrichment ratio of each method in the computer simulation environment; the lower figure shows the decision speed (ms) per read. (c-d) Model performance on human and key pathogen datasets. (c) The box plot illustrates the performance stability of GANBase on mixed data of two types of human hosts (NA12878, NA24385) and multiple pathogens (SARS-CoV-1, Phage, Ebola). (d) The bar chart shows the enrichment ratio and decision speed for different combinations of human hosts and pathogens. The dashed line represents the theoretical maximum enrichment ratio. (e-f) Model remodeling and generalization results on zoonotic disease hosts and other key disease datasets. (e) shows the model performance rebuilt on hosts such as mosquitoes, ticks, mice, and flies, for key pathogen combinations such as Zika virus, West Nile virus (WNV), and Crimean-Congo hemorrhagic fever virus (CCHFV). (f) reflects the \u003cem\u003ein slico \u003c/em\u003eenrichment effect of the above zoonotic disease-related combinations.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/1055ac0e0db93b0396ee624b.jpeg"},{"id":105033202,"identity":"93b21128-0810-4146-8d97-37b7c2b84299","added_by":"auto","created_at":"2026-03-20 07:15:32","extension":"jpeg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":767990,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eGABase can learn representation separability and sequence-level motif interpretation. \u003c/strong\u003e(a-d) Two-dimensional t-SNE projections of high-dimensional sequence representations learned by different models or under different training settings. Each point corresponds to an individual sequence, colored by class label (blue, target; orange, host). Panels show representative cases with varying degrees of class separability, illustrating how the proposed model progressively organizes sequences into more discriminative latent manifolds. (e-f) After host removal from the human-zymo dataset, GANBase identified sequence motifs enriched in the target categories. The horizontal bar chart represents the relative importance scores of the top-ranking k-mer motifs (6mer, e, 12mer, f), while the sequence motif plot visualizes the corresponding nucleotide composition. Green represents the model's predicted weight for human, blue represents the predicted weight for non-human (Zymo) sequences, and gray represents the actual human and non-human sequence proportion. (g) Alignment analysis of the distribution of A/T ratios between human and non-human groups. The violin plot shows the distribution of the predicted (Pred, red) and reference (Ref, blue) A/T ratios.\u003c/p\u003e","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/98869e63ffd186bc25ecc280.jpeg"},{"id":104398285,"identity":"37286117-dfa7-48c9-bfa1-85c106ebdcf8","added_by":"auto","created_at":"2026-03-11 12:01:18","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":621083,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerformance comparison of GANBase and SquiggleNet in a single flow cell. \u003c/strong\u003e(a) Overall process of the wet experiment. Samples were mixed with NA12878 and Zymo mock in a ratio of 1:1 (barcode01), 4:1 (barcode02), and 9:1 (barcode03). Sequencing was performed using a MinION Mk1B sequencer, an R9.4.1 flow cell, and two R10.4.1 flow cells, with channel 1-256 set as the experimental group and channel 257-512 as the control group. The adaptive sequencing protocol comprised two phases: (1) initial 3-hour sequencing with SquiggleNet, followed by (2) a subsequent 3-hour phase using Bonito basecalling coupled with GANBase analysis. (b-c) shows the heatmaps displaying output reads per active channel for SquiggleNet and GANBase at 30-minute intervals. (d-f) The ratio of Zymo mock reads along routine mode and adaptive sequencing mode using GANBase and SquiggleNet. In detail, three samples with different mixing ratios are included: 1:1 (b), 4:1 (c), and 9:1 (d). The light blue bars represent the control group, and the dark blue bars represent the GANBase and SquiggleNet adaptive sequencing experimental groups. (g-i) Absolute and relative enrichment ratios for both sequencing modes. The horizontal axis represents the calculation metrics for the two enrichment ratios, and the vertical axis represents the values of the corresponding metrics. The red bars represent the adaptive sequencing experimental group using GANBase, and the blue bars represent the adaptive sequencing experimental group using SquiggleNet. (j-l) Recall and precision metrics under identical conditions (mixing ratios: 1:1 (h), 4:1 (i), 9:1 (j)). The horizontal axis represents the Recall and Precision, and the vertical axis represents the values of the corresponding metrics. Red represents the adaptive sequencing experimental group using GANBase, and blue represents the adaptive sequencing experimental group using SquiggleNet.\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/1a41f05d93d75fa99132b8b7.jpeg"},{"id":105036446,"identity":"6a4baf14-f4bf-4e0d-9609-b56a2d165776","added_by":"auto","created_at":"2026-03-20 07:33:02","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3475084,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/4079d98f-7ba3-4564-8e78-1327504f505a.pdf"},{"id":103551539,"identity":"ccb59f41-dfd7-4e2b-b40c-3b447fcf1c36","added_by":"auto","created_at":"2026-02-27 02:35:24","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":5204273,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary.docx","url":"https://assets-eu.researchsquare.com/files/rs-8931691/v1/3017d61032f6de6c4c928287.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Genome-Guided Generative Adversarial Learning enables nanopore adaptive sequencing","fulltext":[{"header":"Background","content":"\u003cp\u003eHost DNA depletion remains a key issue in pathogen detection and metagenomic sequencing\u003csup\u003e\u003cspan additionalcitationids=\"CR2 CR3 CR4\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Typical pre-treatment approaches employ biochemical experiments, such as digesting host DNA with specific nucleases\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e or using methyl-CpG binding proteins for selective binding\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. These often suffer from limited applicability, increased experimental complexity, and high processing costs\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Oxford Nanopore Technologies (ONT) addresses this issue through \u0026lsquo;Read Until\u0026rsquo; interface\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, which allows DNA molecules to be classified in real time as they pass through nanopores\u003csup\u003e\u003cspan additionalcitationids=\"CR11 CR12\" citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Once the non-target molecule is classified, the system reverses the electrical current, actively ejecting the molecules and then sequencing a new strand\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. This mechanism facilitates the rapid enrichment of target DNA molecules within a short period. Thus, the development of more efficient DNA sequence classification algorithms tailored for adaptive sequencing has become a central focus of related research.\u003c/p\u003e \u003cp\u003eExisting computational methods can be categorized into two groups: (1) alignment-based methods and (2) deep learning-based methods. Alignment-based methods identify target molecules by matching either nanopore signals or basecalled reads against reference genomes. Representative tools include the method proposed by Loose et al.\u003csup\u003e9\u003c/sup\u003e using the Dynamic Time Warping (DTW) algorithm, UNCALLED by Kovaka et al.\u003csup\u003e15\u003c/sup\u003e, Readfish by Payne et al.\u003csup\u003e16\u003c/sup\u003e, and subsequent DTW variants such as sDTW\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e and cwSDTWNano\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. Although alignment-based methods demonstrate high accuracy and efficiency in practice, such methods face computationally intensive and high memory usage challenges. The official documentation\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e also highlights the key limitation of adaptive sequencing: on RAM-limited devices such as the MK1C, it\u0026rsquo;s almost impossible to perform large background depletion (more than 125 Mb), primarily due to the computational burden of sequence alignment. To address this issue, researchers have explored end-to-end deep learning approaches for adaptive sequencing. SquiggleNet\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e is the first deep learning method, leveraging ResNet\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e to classify Zymo metagenome versus human host reads. However, since supervised models can only perform classification on specific species, to expand the available scenarios, Senanayake et al.\u003csup\u003e22\u003c/sup\u003e addressed the lack of generalizability of SquiggleNet for SARS-CoV-2 and yeast detection with DeepSelectNet, while Danilevsky et al.\u003csup\u003e23\u003c/sup\u003e and Sneddon et al.\u003csup\u003e24\u003c/sup\u003e focused on model development targeting mitochondrial DNA and non-coding RNA, respectively. Regarding the interpretability and performance limitations\u0026mdash;including speed and validation robustness\u0026mdash;Lin et al.\u003csup\u003e25\u003c/sup\u003e introduced the NanoDeep. More recently, Fan et al.\u003csup\u003e26\u003c/sup\u003e proposed a swift model called ReadCurrent, which combines high accuracy with low computational overhead.\u003c/p\u003e \u003cp\u003eAlthough deep learning methods have demonstrated advantages in speed and accuracy, current models still face some key limitations that cannot be ignored. First, existing models are based on supervised learning frameworks, which are built using labeled data\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. When encountering unseen reads from unknown pathogens in the sample, the model would misclassify these reads\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, thereby hindering target enrichment or host DNA depletion efficacy (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). Second, models trained on nanopore electrical signals are usually linked to a specific version of the flow cell. When adaptive sequencing is performed on a new version flow cell, the model needs to be retrained using new sequencing data. This requires additional sequencing experiments to generate signal data as training data. These constraints highlight the limited flexibility and scalability of supervised learning frameworks, posing serious challenges to the broader adoption of adaptive sequencing.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eBased on the above considerations, we design a modular neural network architecture for adaptive sequencing, comprising a basecaller module and a classifier module. The basecaller employs the official open-source model provided by ONT, thereby obviating the need for users to retrain the model following flow cell version updates. The classifier is designed to functionally substitute for sequence alignment, having been trained to identify the classification boundary inherent to the target species (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea). To implement this classification capability, we proposed GANBase, an unsupervised learning framework for adaptive sequencing comprising a pre-trained generator and a discriminator (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb). By iteratively distinguishing real target sequences from synthetic sequences generated by the generator, the discriminator effectively captures the distribution boundary of target sequences. We adopted a Rollout Policy\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e based on Monte Carlo Tree Search (MCTS) \u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e for discrete sequence backpropagation, which estimates the reward value by sampling complete trajectories via rollouts.\u003c/p\u003e \u003cp\u003eIn general, GANBase relies solely on reference genome sequences for training and integrates with the corresponding basecaller to facilitate real-time adaptive sequencing. To validate this framework, we first assessed the feasibility of the architecture on multiple simulated datasets derived from the ZymoBIOMICS High Molecular Weight (HMW) DNA Standard D6322 (referred to as \u0026lsquo;Zymo mock\u0026rsquo;), demonstrating GANBase's capacity for small genome enrichment. We then conducted a systematic assessment across diverse host organisms to evaluate performance generalizability. Finally, we deployed GANBase in live nanopore sequencing experiments to verify its efficacy in real-world adaptive sequencing scenarios.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eGANBase can accurately classify unseen reads in simulated microbial enrichment experiments\u003c/h2\u003e \u003cp\u003eTo validate the enrichment ability of GANBase, we conducted a systematic performance assessment on the sequencing data of eight microorganisms from the ZymoBIOMICS HMW DNA Standard D6322. First, we trained a GANBase model for each of the eight species, using the corresponding reference genomes (Supplementary Table S7). Then we assessed models on the balanced (target: background\u0026thinsp;=\u0026thinsp;14,000:14,000) and imbalanced (target: background\u0026thinsp;=\u0026thinsp;2,000:14,000) datasets, adopting a One-vs-Rest (OvR) strategy (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). The results demonstrate that the unsupervised model GANBase has classification ability, with median ROC-AUC, PR-AUC and F1-scores exceeding 0.7 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb; Supplementary Tables\u0026nbsp;8\u0026ndash;9).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eAs the alignment-based methods are the most commonly used in adaptive sequencing, we then compared GANBase with Minimap2. GANBase achieved recall values ranging from 82.81% (\u003cem\u003eS. cerevisiae\u003c/em\u003e) to 93.03% (\u003cem\u003eP. aeruginosa\u003c/em\u003e) on the balanced dataset (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec), while Minimap2 delivered recall values ranging from 82.67% (\u003cem\u003eS. cerevisiae\u003c/em\u003e) to 98.6% (\u003cem\u003eP. aeruginosa\u003c/em\u003e). Non-parametric permutation test showed that there\u0026rsquo;s no significant difference between the two methods (p\u0026thinsp;\u0026gt;\u0026thinsp;0.05, Supplementary Method), indicating that GANBase achieves comparable classification performance to Minimap2. At the same time, GANBase demonstrates an advantage in speed (~\u0026thinsp;30-fold improvement, Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed).\u003c/p\u003e \u003cp\u003eTo quantify enrichment performance, we defined the \u003cem\u003ein silico\u003c/em\u003e enrichment ratio as the quotient of the target species' abundance in the enriched dataset and abundance without enriched. GANBase demonstrated its capacity of target enrichment, with \u003cem\u003ein silico\u003c/em\u003e enrichment ratios for all eight species were greater than 1, and UMAP visualization further shows that GANBase can clearly separate target and non-target sequences (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee and f).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eHost DNA depletion in simulated host-pathogen mixed datasets using GANBase\u003c/h3\u003e\n\u003cp\u003eThe human genome serves as the predominant host background in pathogen detection\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. Therefore, we first evaluated GANBase\u0026rsquo;s performance on the human host depletion. We trained GANBase using human reference genome the Genome Reference Consortium Human Build 38 (GRCh38)\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e,\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. We implemented the adaptive sequencing pipeline by integrating Bonito v4.3 for basecalling and the trained GANBase for classification. For comparison purposes, we chose the existing signal-based supervised models, including NanoDeep, SquiggleNet, and DeepSelectNet. To mitigate the potential biases of these models training data (Human and Zymo), we retrained all three models using the corresponding public nanopore sequencing data\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e (Supplementary Methods). For testing data, we conducted sequencing experiments on Zymo mock, Yeast, and SARS-CoV-2 (see Method). We mixed the publicly available human sequencing reads and the in-house reads at a 4:1 ratio, with a total of 100,000 reads per test set. As for result, GANBase achieved the best performance in terms of accuracy, precision, specificity, speed, and \u003cem\u003ein\u003c/em\u003e silico enrichment ratio across datasets (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea-c). Such simulated experiments demonstrate that GANBase has a potential advantage in human host depletion scenarios across deep learning methods.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo assess whether the model can tolerate individual genomic differences, we tested GANBase on different hosts (NA12878, NA24385) mixed with different target pathogen (SARS-CoV-1, Ebola, and Phage) separately. GANBase demonstrated consistently high classification efficacy, with ROC-AUC and PR-AUC values exceeding 88.9% and 87.0% across all tested host-pathogen combinations (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec). Notably, the observed in silico enrichment ratios (1.84\u0026ndash;1.99) closely approached the theoretical optimum of 2 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed). The results indicate that GANBase maintains stable classification performance across different individual backgrounds.\u003c/p\u003e \u003cp\u003eGiven the pivotal role of zoonotic reservoirs in infectious diseases, extending GANBase to non-human hosts is of practical significance\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e. To demonstrate this capability, we trained four distinct GANBase models on the reference genomes of key species, including \u003cem\u003eAnopheles cruzii\u003c/em\u003e (NCBI: 68878), \u003cem\u003eMus musculus\u003c/em\u003e (GRcm39, NCBI: 10090), \u003cem\u003eRhipicephalus microplus\u003c/em\u003e (NCBI: 6941), and \u003cem\u003eDrosophila melanogaster\u003c/em\u003e (NCBI: 7227). To assess the performance of the model, we generated eight simulated datasets by combining host and pathogen reads in a 1:1 ratio, utilizing data from the SRA database (refer to method).\u003c/p\u003e \u003cp\u003eGANBase achieved ROC-AUC values from 52.84% (tick \u0026amp; CCHFV) to 98.38% (mouse \u0026amp; Y. pestis), and PR-AUC values ranging from 54.17% (tick \u0026amp; Zymo mock) to 98.50% (mouse \u0026amp; Y. pestis). Notably, while the tick datasets yielded lower discriminative scores (ROC-AUC and PR-AUC), complementary metrics confirmed that remained highly effective at depleting host sequences in these samples. In particular, for the tick \u0026amp; CCHFV dataset, GANBase achieved a high specificity of 99.92% and an \u003cem\u003ein silico\u003c/em\u003e enrichment ratio of 1.96. For the tick \u0026amp; Zymo mock dataset, it achieved a high negative predictive value (NPV) of 91.00% and an \u003cem\u003ein silico\u003c/em\u003e enrichment ratio of 1.56. Across all samples, \u003cem\u003ein silico\u003c/em\u003e enrichment ratios ranged from 1.56 to 1.97 (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ef), consistently validating GANBase\u0026rsquo;s ability to effectively deplete host sequences across diverse genomic backgrounds.\u003c/p\u003e\n\u003ch3\u003eInterpretability analysis shows that GANBase has learned effective classification features\u003c/h3\u003e\n\u003cp\u003eTo investigate whether the adaptive enrichment capability of GANBase is driven by biologically meaningful sequence features rather than spurious correlations, we analyzed the internal sequence representations learned by the model.\u003c/p\u003e \u003cp\u003eT-SNE analysis showed that target and background reads formed clearly separable clusters in the embedding space in the four host-target combinations (Mosquito \u0026amp; Zika, Mouse \u0026amp; Y. pestis, Tick \u0026amp; CCHFV, Fly \u0026amp; Zymo mock; Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea-d). Although there was some overlap at the boundary regions, overall, GANBase successfully mapped reads from both target and non-target species to different regions of the feature space.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eWe then performed motif analysis using WebLogo\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. The most-attended k-mer motifs revealed that sequences which were classified as human possessed relatively higher A/T proportion (59.90%). This finding is consistent with both our calculated results (59.71%) and the reported human genome average (59.13%\u003csup\u003e37\u003c/sup\u003e), with no significant difference. Conversely, sequences classified as non-human exhibited a relatively lower proportion of A/T (predicted 49.92%, reference 51.30%). The inter-group differences between the human and Zymo mock were highly significant (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.0001).\u003c/p\u003e \u003cp\u003eThese results indicate that GANBase studied the sequence motifs patterns which have biologically interpretable and target-specific, which form the basis of its adaptive sequencing performance.\u003c/p\u003e\n\u003ch3\u003eReal-world adaptive sequencing experiment for comparison\u003c/h3\u003e\n\u003cp\u003eTo evaluate the performance of GANBase in real-time adaptive sequencing, we conducted two wet experiments, using different version flow cells (R9.4.1 and R10.4.1) respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea). The mixture samples were prepared using NA12878 and Zymo mock (D6322).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn the R9 sequencing experiment, we performed sequential sequencing on the same flow cell, running SquiggleNet followed by GANBase (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). Despite a 41.41% reduction in active nanopores from the initial run, GANBase remained effective at depleting human DNA. In all barcodes, GANBase outperformed SquiggleNet in terms of \u003cem\u003ein silico\u003c/em\u003e enrichment ratio, recall and precision (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed and f, Supplementary Table\u0026nbsp;29).\u003c/p\u003e \u003cp\u003eIn the R10 sequencing experiment, the sequencing was performed on two separate flow cells, the results showed that the performance gap between the two methods widened significantly. Specifically, GANBase achieved \u003cem\u003ein silico\u003c/em\u003e enrichment ratios of 1.98-fold, 4.37-fold, and 6.97-fold. Conversely, the \u003cem\u003ein silico\u003c/em\u003e enrichment ratio of SquiggleNet dropped below 1.0 (0.75-fold to 0.65-fold), with both low recall and precision, indicating a failure to enrichment (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ee and g). This may be because SquiggleNet was only trained on R9 data, resulting in a large number of misclassifications in R10 tests. In contrast, we obervsed that GANBase show a better performance of R10 than R9 test. GANBase utilizes Bonito for basecalling, enabling it to adapt to newer flow cell versions.\u003c/p\u003e \u003cp\u003eIn summary, despite facing more challenging experimental conditions, GANBase performed comparably to, or even better than, SquiggleNet, and is not limited by the version of the sequencing flow cell. This demonstrates the robust performance of GANBase and its ability to focus limited sequencing resources on important target sequences, thereby significantly improving sequencing efficiency.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eNanopore adaptive sequencing enriches or depletes target reads by determining the origin of DNA/RNA molecules in real time. Existing learning-based methods use supervised architectures which uses labeled signal datasets and are unable to classify unseen data accurately. Moreover, signal data generated by different flow cell versions is inconsistent. Once the version of flow cells updates, the corresponding model should be re-trained with new sequencing data, which inevitably increases the computational and experimental overhead for data acquisition. As for alignment-based methods, alignment algorithms are computationally intensive and memory-intensive, which are slower than deep learning methods and can\u0026rsquo;t work well in handheld devices with limited computing.\u003c/p\u003e \u003cp\u003eTo address this issue, we developed GANBase, an unsupervised model based on GAN that integrates an MCTS reward scoring mechanism to achieve backpropagation on discrete sequences. Experimental cases on 13 species and live adaptive sequencing experiments demonstrate that the GANBase, built solely based on the reference genome, can effectively classify targets and overcome the limitations mentioned above. Although GANBase\u0026rsquo;s enrichment capability are slightly inferior to alignment-based methods (Supplementary Table\u0026nbsp;30), its speed and memory advantages still indicate its application potential (see Supplementary Material). In particular, GANBase uses less memeory (only 2.6 MB for model parameters) than alignment-based methods (e.g. Minimap2, which requires 7.2GB to load the index file), and run faster (0.47 milliseconds per read, compared to 24.14 milliseconds for Minimap2).\u003c/p\u003e \u003cp\u003eHowever, there are some key limitations in this study. Firstly, the current implementation of the sequencing pipeline, which was developed in Python, lacks in terms of extensive engineering optimization. Its computational throughput and latency have yet to reach the performance levels of native C or C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;implementations. Secondly, key operational parameters within the adaptive sequencing pipeline require further systematic tuning and optimization, such as signal extraction intervals, chunking strategies, and batch sizes, as these are critical factors that directly impact the adaptive sequencing performance. Thirdly, our difference tolerance tests involved only two individuals, therefore it requires further verification.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn summary, GANBase represents a scalable solution for adaptive sequencing, facilitating its translation into clinical and field applications. GANBase\u0026rsquo;s lightweight nature enables it to be used in a wide range of deployment scenarios, ranging from high-throughput centers to remote field locations. Even in computationally constrained environments\u0026mdash;specifically those utilizing portable sequencers (e.g., MinION Mk1C) or edge computing devices\u0026mdash;users can execute real-time target enrichment or depletion by simply loading pre-trained reference weights tailored to specific taxa. Consequently, GANBase emerges as a potential solution for the on-site detection of emerging pathogens, offering profound implications for global genomic surveillance and rapid public health response.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eDataset construction\u003c/h2\u003e \u003cp\u003eThe ZymoBIOMICS HMW DNA Standard D6322 is a mixture of genomic DNA isolated from pure cultures of seven bacterial and one fungal strain, including \u003cem\u003eB. subtilis\u003c/em\u003e, \u003cem\u003eE. faecalis\u003c/em\u003e, \u003cem\u003eE. coli\u003c/em\u003e, \u003cem\u003eL. monocytogenes\u003c/em\u003e, \u003cem\u003eP. aeruginosa\u003c/em\u003e, \u003cem\u003eS. enterica\u003c/em\u003e, \u003cem\u003eS. aureus\u003c/em\u003e, and \u003cem\u003eS. cerevisiae\u003c/em\u003e. To ascertain the viability of GANBase, we undertook the training of multiple classifiers on distinct microbiome genomes separately, using the sequences downloaded from \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://s3.amazonaws.com/Zymo-files/BioPool/D6322.refseq.zip\u003c/span\u003e\u003cspan address=\"https://s3.amazonaws.com/Zymo-files/BioPool/D6322.refseq.zip\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Considering that the sequence decision duration in adaptive sequencing is within 1s\u003csup\u003e38\u003c/sup\u003e, the length of the sequence input to the model should be shorter than 400bp (The speed of DNA molecules passing through the pore is 450bp/s). To minimize the decision duration while maintaining the decision accuracy, we constructed 16 training sets with different model input lengths on \u003cem\u003eS. aureus\u003c/em\u003e (Supplementary Table\u0026nbsp;1). For the other seven reference genomes, we split the reference sequences into segments using sliding windows with a predefined window length (200 bp) and step size (100 bp), and constructed seven training sets.\u003c/p\u003e \u003cp\u003eFor all trained models related to the Zymo mock, we built multiple test sets from the in-house Zymo mock sequencing data and conducted a performance evaluation. The sequencing data were yielded using MinKNOW v23.11.4\u003csup\u003e39\u003c/sup\u003e and basecalled using Guppy basecaller v6.2.1 (nanoporetech.com/zh/document/Guppy-protocol). Once the adapter sequence and barcode sequence had been trimmed, each read was aligned against the Zymo mock reference using Minimap2 v2.22\u003csup\u003e40\u003c/sup\u003e. By extracting the first N-bp of the sequences, we processed the reads aligned to each species and built the test sets. In particular, we built balanced test sets and imbalanced test sets for models trained on eight species of Zymo mock, which contain 28,000 reads and 16,000 reads, respectively.\u003c/p\u003e \u003cp\u003eFor the depletion of human host DNA, the training data was built using the GRCh38. The reference genome sequences of autosomes and sex chromosomes were divided into segments using a sliding window length of 200 bp and a step size of 200 bp, which corresponds to approximately 1-fold coverage of the entire genome. To evaluate performance, we constructed 33 test sets by manually combining in-house nanopore sequencing data, including the Zymo mock DNA and SARS-CoV-2 standards from Twist, with publicly available nanopore sequencing data for NA12878 and NA24385 in varying mixing ratios (Supplementary Table\u0026nbsp;3). The NA12878 and NA24385 datasets were sourced from the Oxford Nanopore Human Reference Dataset\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e and the Human Pangenome Reference Consortium (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG002/nanopore/\u003c/span\u003e\u003cspan address=\"https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG002/nanopore/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), respectively.\u003c/p\u003e \u003cp\u003eFor the zoonotic hosts, we trained four weights of GANBase on the reference genomes, including Anopheles cruzii (mosquito, NCBI Taxonomy ID: 68878), Mus musculus strain (mouse, GRcm39, NCBI ID: 10090), Rhipicephalus microplus (tick, NCBI ID: 6941), and Drosophila melanogaster (fruit fly, NCBI ID:7227), respectively. The simulated host-pathogen mixed datasets were constructed using the data from the SRA database, including: (1) Anopheles mosquito (SRA: DRP012751) \u0026amp; Zika virus (SRA: SRP072852), (2) Anopheles mosquito \u0026amp; West Nile Virus (WNV, SRA: ERR6357505), (3) house mouse (SRA: ERS20299361) \u0026amp; Zymo mock, (4) house mouse \u0026amp; Yersinia pestis (Y. pestis, SRA: SRP576427), (5) Dermacentor silvarum (SRA: SRP565110) \u0026amp; Zymo mock, (6) Dermacentor silvarum \u0026amp; Crimean-Congo Hemorrhagic Fever Virus (CCHFV, SRA: ERP130784). (7) Fruit fly (sequencing data from ONT) \u0026amp; Zymo mock, (8) Fruit fly \u0026amp; denv (SRA: SRR36350780). Each test set comprised a 1:1 mixture of host and pathogen reads.\u003c/p\u003e \u003cp\u003eFor all the datasets mentioned above, we used discrete tokens {0, 1, 2, 3} to represent nucleotides {A, C, G, T}.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eModel architecture\u003c/h2\u003e \u003cp\u003eGANBase is inspired by SeqGAN\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e, a Generative Adversarial Network (GAN) \u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003especifically developed for generating real-valued data. In contrast to SeqGAN, GANBase uses the generative model to guide the training of a discriminative model. For the generative model, we used a five-layer Long Short-Term Memory (LSTM)\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e neural network to generate the probabilities of the four nucleotides at each base position. Before the adversarial training process, the generative model was first pretrained using the Maximum Likelihood Estimation (MLE) method. The pretraining was conducted for 15 epochs using the training set. Then the generator generated the nucleotide at each position in turn, according to the conditional probability. For the discriminator model, we used a six-layer Transformer Encoder\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e and a linear layer to determine whether the input sequence is from the training set or not. The discriminator model was also undergoing a pretraining process with five epochs using the sequences from the training set and the pretrained generator. The detailed hyperparameters are shown in Supplementary Table\u0026nbsp;20\u0026ndash;22.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eNanopore sequencing experiment for the collected samples\u003c/h2\u003e \u003cp\u003eWe performed nanopore sequencing on the collected samples using a MinION sequencer (MK1B, Oxford Nanopore Technologies, ONT). Libraries were prepared separately for Zymo mock DNA, the fungal component of the Zymo mock (S. cerevisiae), and the SARS-CoV-2 standard (from Twist Bioscience). MinION sequencing was conducted on R9.4.1 flow cells (FLO-MIN106, ONT) according to the manufacturer\u0026rsquo;s protocol. Basecalling was performed using Guppy (v6.2.1), and the resulting sequences were aligned to their corresponding reference genomes using minimap2 (v2.22).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eModel training and evaluation\u003c/h2\u003e \u003cp\u003eIn our experiments, we used the reinforcement learning method to train each model on the training dataset. The specific settings were as follows: the batch size was set to 280, meaning that 280 samples were processed in each batch, and the training ran for 50 iterations. To prevent overfitting, we implemented an early stopping strategy, which halted training if the loss did not decrease for five consecutive rounds. For optimization, we used the Adam algorithm for the generator and stochastic gradient descent (SGD) for the discriminator. GANBase is built using Pytorch (v1.10.0) and Python 3.6. Model training and testing were performed on an Ubuntu 20.04.6 system powered by an Intel(R) Xeon(R) Gold 6126T CPU @ 2.60GHz CPU and NVIDIA RTX A5000.\u003c/p\u003e \u003cp\u003eIn the training process, the generator constructs each position of the sequence sequentially, using a Monte Carlo Tree Search to sample possible sequences for the subsequent positions. These sequences are then passed to the discriminator, which calculates a score as a reward. Then the generator is updated using a policy gradient method. The configuration was designed to balance training efficiency and model performance, ensuring optimal learning outcomes.\u003c/p\u003e \u003cp\u003eThe adaptive sequencing pipeline is shown in the Supplementary Fig.\u0026nbsp;10. The experiments were performed on a workstation equipped with an NVIDIA GeForce RTX 4090 graphics card and an Ubuntu 20.04 system. The experiments also used the nanopore sequencer MinION MK1B (ONT) and MinKNOW software (version 23.11.4). We show the size and training time on different reference genomes in Supplementary Table\u0026nbsp;25\u0026ndash;26.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eAdaptive sequencing of the depletion of host human DNA in the case study\u003c/h2\u003e \u003cp\u003eWe mixed the human DNA standard (NA12878 standard) and microbial DNA standard (Zymo mock) according to different DNA quality ratios and constructed libraries for adaptive sequencing experiments. First, the DNA fragments of the NA12878 standard and the Zymo mock were sheared using gtube, and the sheared target fragment was set to 6 kbp. Pippin HT (Sage Science) was used to screen the sheared DNA fragments, and DNA fragments longer than 6 kbp were retained. After the samples were charged according to their quantitative concentration using the Qubit4.0 nucleic acid quantifier and the Qsep100 Qsep100 biological fragment analyzer, the NA12878 standard and the Zymo mock were mixed in a ratio of 1:1 (200ng:200ng), 4:1 (320ng:80ng), and 9:1 (360ng:40ng) to obtain microbial samples containing human DNA. Then the sequencing library was prepared. All three mixed DNA samples were used to construct nanopore sequencing libraries. The reagents used included NEBNext Ultra II End Repair/dA-tailing Module (New England Biolabs, NEB, USA), Native Barcoding Kit (Oxford Nanopore Technologies, ONT, UK), NEBBlunt/TA Ligase Master Mix (NEB, USA), ligation sequencing kit LSK110 (ONT, UK), NEBNext Quick Ligation Module (NEB, USA), ligation sequencing kit LSK114 (ONT, UK), R9.4.1 Flow Cells, and R10.4.1 Flow Cells. The experimental operation steps were carried out according to the library construction instructions.\u003c/p\u003e \u003cp\u003eFor the adaptive sequencing, we used the Read Until API provided by ONT and the GANBase model to run the adaptive sequencing script. Considering the parameter requirements of GANBase for input sequences, we set the interval time of API calls to 0.85s. To make GANBase have a faster processing speed during the genotyping process, we used the Bonito basecaller instead of the Guppy basecaller Server originally required by the API. Specifically, we implemented Bonito as a local function, avoiding the need to call additional processes. With this step, we were able to directly import the detected electrical signals and convert them to sequences by Bonito basecaller, significantly reducing the time loss of data processing.\u003c/p\u003e \u003cp\u003eDuring the sequencing process, the adaptive sequencing script used the head of signal of each DNA molecule as input and then analyzed the signal through basecaller and host-depletion model GANBase. The script sent the analysis results to the MinION sequencer (MK1B, ONT) to decide whether to continue sequencing the DNA molecule. If GANBase classifies the DNA molecules as non-host reads, MinION allows them to pass the pore and perform complete sequencing. On the contrary, if the DNA molecules are decided as host reads, MinION will terminate their sequencing and eject the molecules from the pore.\u003c/p\u003e \u003cp\u003eIt is worth noting that although MinKNOW's adaptive sequencing method uses a sequence alignment strategy, the processing time for each sequence may be shorter than that of the AI-based method, which shows that MinKNOW's adaptive sequencing has undergone a lot of engineering work.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eEvaluation metrics\u003c/h2\u003e \u003cp\u003eIn this study, we used deep learning-based metrics, \u003cem\u003ein silico\u003c/em\u003e enrichment ratio, relative enrichment ratio, and absolute enrichment ratio to perform the assessment. In the task of genome enrichment, the target reads were classified as positive samples, while the host DNA was categorized as negative samples in the task of genome depletion. Therefore, we used true positive (TP), false positive (FP), true negative (TN), and false negative (FN) to calculate the ACC, ROC-AUC, PR-AUC, Precision, Recall, (MCC), and F1 score.\u003c/p\u003e \u003cp\u003eThe detailed calculation formulas are shown in the Supplementary Materials. The \u003cem\u003ein silico\u003c/em\u003e enrichment ratio was measured by calculating the ratio of the percentage of target reads in experiments conducted with adaptive sequencing compared to those without.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eThreshold choice\u003c/h2\u003e \u003cp\u003eSince classification performance is affected by the threshold, such as accuracy and in silico enrichment ratio, we evaluated the impact of the threshold on classification results under 11 different mixing ratios of hosts and pathogens to select an appropriate classification threshold. These ratios ranged from moderately imbalanced (5:1 to 10k:1) to extremely imbalanced (100k:1). Analysis of the Matthews correlation coefficient (MCC) and F1-score curves under different thresholds revealed that for host removal tasks, the model exhibits stronger performance in the lower threshold range (Supplementary Fig.\u0026nbsp;5). Moreover, in highly unbalanced scenarios, overly aggressive threshold settings can be counterproductive, as maintaining recall is crucial for preserving rare target sequences. Considering both the \u003cem\u003ein silico\u003c/em\u003e enrichment ratio heatmap and the changes in various classification metrics with the threshold, we found that when the threshold is set to 0.1, the model maintains optimal overall performance across different mixing ratios. At this threshold, even under extreme dilution conditions of 100k:1, the model can still balance the detection limit and false positive rate, thus ensuring the reliability of subsequent analyses.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003ch2\u003e \u003cb\u003eEthics declarations\u003c/b\u003e \u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e \u003cp\u003eNo ethnical approval was required for this study.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent for publication\u003c/strong\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eCompeting interests\u003c/strong\u003e \u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eNot applicable.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eY.Z. conceived the project, designed and performed the in-silico experiments, conducted all data analysis, and drafted the manuscript. P.S. guided the work and revised the manuscripts. J.Z. conducted wet laboratory experiments, including DNA extraction, amplification, Nanopore sequencing, and adaptive sequencing. K.F. performed the in-silico part of the adaptive sequencing wet lab experiment. Z.F. and X.B. revised the manuscripts. M.N. offered advice and guidance on the study and revised the manuscript. Z.R. conceived the project and designed the in-silico experiments, and drafted the manuscript. All authors contributed to the article and approved the submitted version.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eGenome Reference Consortium Human Build 38 can be obtained from [https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\\_000001405.26/](https:/www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26) . The reference genomes of the fruit fly, mosquito, mouse, and tick are from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\\_000001635.27/, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\\_000001215.4, https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\\_943734635.1/, and https://www.ncbi.nlm.nih.gov/datasets/genome/GCF\\_013339725.1/, respectively. The sequencing data of NA24385 can be obtained from [https://github.com/marbl/HG002/blob/main/Sequencing\\_data.md](https:/github.com/marbl/HG002/blob/main/Sequencing_data.md) . The sequencing data of NA12878 can be obtained from [https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md](https:/github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md) . The fruit fly sequencing data were obtained from the Oxford Nanopore Open Data Project (source: s3://ont-open-data/contrib/melanogaster\\_bkim\\_2023.01/flowcells/D.melanogaster.R1041.400bps/). The sequencing data were obtained from the following SRA entries: mosquito (DRP012751), mouse (ERS20299361), tick (SRP565110), Zika virus (SRP072852), WNV (ERR6357505), Yersinia pestis (SRP576427), and CCHFV (ERP130784).\u003c/p\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eCode availability\u003c/h2\u003e \u003cp\u003eThe GANBase software is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/renzilin/GANBase\u003c/span\u003e\u003cspan address=\"https://github.com/renzilin/GANBase\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMarotz CA, et al. Improving saliva shotgun metagenomics by chemical host DNA depletion. Microbiome. 2018;6:42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHeravi FS, Zakrzewski M, Vickery K, Hu H. Host DNA depletion efficiency of microbiome DNA enrichment methods in infected tissue samples. J Microbiol Methods. 2020;170:105856.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGanda E et al. DNA Extraction and Host Depletion Methods Significantly Impact and Potentially Bias Bacterial Detection in a Biological Fluid. \u003cem\u003emSystems\u003c/em\u003e 6. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1128/msystems.00619\u0026thinsp;\u0026ndash;\u0026thinsp;21\u003c/span\u003e\u003cspan address=\"10.1128/msystems.00619\u0026thinsp;\u0026ndash;\u0026thinsp;21\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2021).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen Y-C, et al. Optimization of Metagenomic Next-Generation Sequencing Workflow with a Novel Host Depletion Method for Enhanced Pathogen Detection. Mol Diagn Ther. 2025;29:689\u0026ndash;99.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang C, et al. Benefits and challenges of host depletion methods in profiling the upper and lower respiratory microbiome. npj Biofilms Microbiomes. 2025;11:130.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCharalampous T, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37:783\u0026ndash;92.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMiller S, et al. Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res. 2019;29:831\u0026ndash;42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHasan MR, et al. Depletion of Human DNA in Spiked Clinical Specimens for Improvement of Sensitivity of Pathogen Detection by Next-Generation Sequencing. J Clin Microbiol. 2016;54:919\u0026ndash;27.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLoose M, Malla S, Stout M. Real-time selective sequencing using nanopore technology. Nat Methods. 2016;13:751\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDeamer DW, Akeson M. Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends Biotechnol. 2000;18:147\u0026ndash;51.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRestrepo-P\u0026eacute;rez L, Joo C, Dekker C. Paving the way to single-molecule protein sequencing. Nat Nanotech. 2018;13:786\u0026ndash;96.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMarquet M, et al. Evaluation of microbiome enrichment and host DNA depletion in human vaginal samples using Oxford Nanopore\u0026rsquo;s adaptive sequencing. Sci Rep. 2022;12:4000.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMeyer D, et al. Unlocking the full potential of nanopore sequencing: tips, tricks, and advanced data analysis techniques. Nucleic Acids Res. 2026;54:gkag023.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLoose M, Malla S, Stout M. Real-time selective sequencing using nanopore technology. Nat Methods. 2016;13:751\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKovaka S, Fan Y, Ni B, Timp W, Schatz MC. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol. 2021;39:431\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePayne A, et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat Biotechnol. 2021;39:442\u0026ndash;50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShih PJ, Saadat H, Parameswaran S, Gamaarachchi H. Efficient real-time selective genome sequencing on resource-constrained devices. Gigascience. 2022;12:giad046.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan R, Wang S, Gao X. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing. Bioinformatics. 2020;36:1333\u0026ndash;43.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAdaptive sampling. \u003cem\u003eOxford Nanopore Technologies\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://nanoporetech.com/document/adaptive-sampling\u003c/span\u003e\u003cspan address=\"https://nanoporetech.com/document/adaptive-sampling\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2020).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBao Y, et al. SquiggleNet: real-time, direct classification of nanopore signals. Genome Biol. 2021;22:298.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. in 2016 \u003cem\u003eIEEE Conference on Computer Vision and Pattern Recognition (CVPR)\u003c/em\u003e 770\u0026ndash;778 (2016). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CVPR.2016.90\u003c/span\u003e\u003cspan address=\"10.1109/CVPR.2016.90\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSenanayake A, Gamaarachchi H, Herath D, Ragel R. DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing. BMC Bioinformatics. 2023;24:31.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDanilevsky A, Polsky AL, Shomron N. Adaptive sequencing using nanopores and deep learning of mitochondrial DNA. Brief Bioinform. 2022;23:bbac251.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSneddon A et al. Biochemical-free enrichment or depletion of RNA classes in real-time during direct RNA sequencing with RISER. 2022.11.29.518281 Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2022.11.29.518281\u003c/span\u003e\u003cspan address=\"10.1101/2022.11.29.518281\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin Y, et al. NanoDeep: a deep learning framework for nanopore adaptive sampling on microbial sequencing. Brief Bioinform. 2023;25:bbad499.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFan K, et al. ReadCurrent: a VDCNN-based tool for fast and accurate nanopore selective sequencing. Brief Bioinform. 2024;25:bbae435.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShetty SH, Shetty S, Singh C, Rao A. Supervised Machine Learning: Algorithms and Applications. in \u003cem\u003eFundamentals and Methods of Machine and Deep Learning\u003c/em\u003e 1\u0026ndash;16 (John Wiley \u0026amp; Sons, Ltd, 2022). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1002/9781119821908.ch1\u003c/span\u003e\u003cspan address=\"10.1002/9781119821908.ch1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhou Z-H. A brief introduction to weakly supervised learning. Natl Sci Rev. 2018;5:44\u0026ndash;53.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReinforcement learning and optimal control | Dimitri Bertsekas. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://faculty.engineering.asu.edu/bertsekas/books/reinforcement-learning-and-optimal-control/\u003c/span\u003e\u003cspan address=\"https://faculty.engineering.asu.edu/bertsekas/books/reinforcement-learning-and-optimal-control/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eŚwiechowski M, Godlewski K, Sawicki B, Mańdziuk J. Monte Carlo Tree Search: A Review of Recent Modifications and Applications. Artif Intell Rev. 2023;56:2497\u0026ndash;562.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShi Y, Wang G, Lau HC-H, Yu J. Metagenomic Sequencing for Microbial DNA in Human Samples: Emerging Technological Advances. Int J Mol Sci. 2022;23:2181.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo Y, et al. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics. 2017;109:83\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuman genome reference builds -. GRCh38 or hg38 - b37 - hg19. \u003cem\u003eGATK\u003c/em\u003e \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19\u003c/span\u003e\u003cspan address=\"https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003enanopore-wgs-consortium/NA12878. nanopore-wgs-consortium. (2026).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVashisht V et al. Genomics for Emerging Pathogen Identification and Monitoring: Prospects and Obstacles. \u003cem\u003eBioMedInformatics\u003c/em\u003e 3, 1145\u0026ndash;1177 (2023).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCrooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: A Sequence Logo Generator. Genome Res. 2004;14:1188\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePiovesan A, et al. On the length, weight and GC content of the human genome. BMC Res Notes. 2019;12:106.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEdwards HS, et al. Real-Time Selective Sequencing with RUBRIC: Read Until with Basecall and Reference-Informed Criteria. Sci Rep. 2019;9:1\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJain M, Olsen HE, Paten B, Akeson M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol. 2016;17:239.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094\u0026ndash;100.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJain M, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338\u0026ndash;45.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu L, Zhang W, Wang J, Yu Y. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/arXiv.1609.05473\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1609.05473\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2017).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGoodfellow IJ et al. MIT Press, Cambridge, MA, USA,. Generative adversarial nets. in \u003cem\u003eProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2\u003c/em\u003e 2672\u0026ndash;2680 (2014).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHochreiter S, Schmidhuber J. Long Short-term Memory. Neural Comput. 1997;9:1735\u0026ndash;80.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaswani A et al. Attention Is All You Need. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://arxiv.org/abs/1706.03762\u003c/span\u003e\u003cspan address=\"http://arxiv.org/abs/1706.03762\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2017).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Nanopore sequencing, adaptive sequencing, read until, Generative Adversarial Network (GAN), real-time targeted sequencing","lastPublishedDoi":"10.21203/rs.3.rs-8931691/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8931691/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eNanopore adaptive sequencing enables real-time target enrichment, yet current deep-learning methods require costly, sample-specific experimental training data. To address this, we developed GANBase, a genome-guided generative adversarial learning framework, which is trained exclusively on reference sequences and incorporates Monte Carlo Tree Search-based Rollout strategy for model training. GANBase demonstrates robust performance in target enrichment and host depletion across diverse scenarios. In live adaptive sequencing experiments, it remains effective despite significant pore loss or flow cell version updates, providing a data-independent solution that significantly expands the utility of real-time targeted sequencing.\u003c/p\u003e","manuscriptTitle":"Genome-Guided Generative Adversarial Learning enables nanopore adaptive sequencing","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-27 02:35:19","doi":"10.21203/rs.3.rs-8931691/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f1dde18f-cfdd-488a-8908-d4e7b353c45d","owner":[],"postedDate":"February 27th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-03-06T14:26:10+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-27 02:35:19","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8931691","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8931691","identity":"rs-8931691","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.