opjMap: A Sensitive Mapper for Repetitive Structural Variations in Long Noisy Reads Based on Orthogonal Projection | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article opjMap: A Sensitive Mapper for Repetitive Structural Variations in Long Noisy Reads Based on Orthogonal Projection Xing-Guo Fan, Xiao-Dan Zhang, Cheng-Song Hu, Jie-Jie Zeng, Shu-Rui Li, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7929852/v1 This work is licensed under a CC BY 4.0 License Status: Under Revision Version 1 posted 10 You are reading this latest preprint version Abstract Background The continuous advancements in single-molecule sequencing (SMS) technologies, including PacBio Single Molecule Real-Time and Oxford Nanopore Technologies (ONT), have led to a significant increase in read lengths. This has unlocked tremendous potential for a wide range of cutting-edge genomic applications. However, these long reads suffer from higher sequencing error rates and contain repetitive segments, making it challenging for most existing alignment tools to effectively map these repetitive regions. Given the crucial role that repetitive variations play in biological evolution, we introduce opjMap, an alignment tool based on orthogonal projection localization, which is specifically designed to align long, noisy SMS reads to a reference sequence while also accommodating repetitive structural variations (SVs). Results Through exhaustive benchmark experiments on both simulated and real SMS datasets, we demonstrate that opjMap exhibits higher sensitivity compared to other mainstream alignment tools like minimap2, NGMLR, and Winnowmap2, enabling it to align more reads and bases to the reference genome. Furthermore, opjMap produces a greater number of alignment results under challenging conditions of high error rates and short repetitive segments. Conclusions opjMap provides a robust and highly sensitive solution for mapping noisy long reads containing repetitive structural variations. opjMap supports multi-threaded alignment. The source code is publicly available for download at https://github.com/FanXingGuo/opjMap . High error rate long-read alignment orthogonal projection repetitive variations segmental duplication Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Background Sequence alignment is a fundamental technique in bioinformatics and serves as the cornerstone for subsequent biological sequence analyses[ 1 – 5 ]. Biological sequences, typically obtained through sequencing technologies, are continuous chains of nucleotides (such as DNA, RNA) or amino acids (such as proteins)[ 6 ]. These sequences encode the complete genetic information of an organism and are crucial for essential life activities, including growth, development, and metabolism. The primary purpose of sequence alignment is to determine the similarity between biological sequences, which in turn facilitates the study of species homology and evolutionary processes[ 7 – 11 ]. Third-generation sequencing technologies are characterized by a high error rate, which makes it challenging to directly and accurately align the reads to a reference genome[ 12 ]. For this reason, most state-of-the-art third-generation sequence alignment algorithms utilize a seed-and-extend approach[ 13 ]. When sequencing errors are present, the overall read may not perfectly match a local region of the reference genome, but the two sequences will share numerous identical short substrings (seeds)[ 14 ]. A key principle is that a reference region containing more shared seeds is more likely to be the correct mapping location for the read. Based on the seeding strategy employed, existing third-generation alignment methods can be broadly categorized into two types: dynamic programming-based alignment and voting-based alignment. Dynamic programming-based alignment methods select regions with a high density of collinear seeds, effectively filtering out irrelevant seeds (noise) to enhance accuracy. These collinear seeds can serve as a basic skeleton for base-to-base alignment, which is why this approach is widely adopted. Several notable algorithms exemplify this strategy. GraphMap[ 15 ] utilizes a hash-based indexing technique and a conservative, stepwise filtering strategy for candidate regions, achieving high sensitivity and speed. Minimap2[ 16 ] pioneered the use of minimizers—seeds with the smallest hash values within a given window—to construct its index. This approach has demonstrated superior performance in aligning long reads with high error rates. NGMLR[ 17 ] employs a convex gap-scoring model to handle gaps between skeleton segments, enabling it to effectively align sequences with minor insertions, deletions, and large-scale structural variations. kngmap[ 18 ] focuses on identifying the maximum number of collinear seeds for localization, which allows it to align a greater number of reads and bases. Furthermore, it leverages gap lengths between skeleton seeds to identify structural variation types, demonstrating a capability to align a range of structural variants. Voting-based alignment methods, in contrast, statistically rank candidate windows by counting the number of shared seeds and selecting the top m windows for further analysis. This approach has lower computational complexity than dynamic programming and provides a more holistic view by considering multiple overlapping regions, but it is more susceptible to including noise. rHAT[ 19 ] improves alignment speed and quality by using overlapping windows on the reference genome and extracting k -mers from the read for efficient lookup. lordFAST[ 20 ] enhances this approach by considering not only the number of seeds during localization but also their length. By combining hash indexing with the FM-index[ 21 ], lordFAST achieves superior performance in both alignment speed and memory utilization. Despite the development of two main third-generation sequencing alignment approaches—dynamic programming-based and voting-based—to effectively handle long reads with high error rates, these methods still face limitations when confronted with duplication, such as interspersed repeats and segmental duplications. Specifically, dynamic programming algorithms struggle to process overlapping variant skeletons effectively, while voting-based methods are susceptible to noise, leading to reduced alignment sensitivity and quality for duplication. Duplication plays a crucial role in significant genomic structural changes and is fundamental to biological evolution[ 22 ]. However, the complex structure of duplication often compromises the sensitivity of existing third-generation alignment tools in detecting and aligning these variations[ 23 ]. Consequently, the development of specialized tools that can leverage the unique characteristics of repeat variation has become a pressing issue in the advancement of third-generation sequencing technologies[ 24 ]. To address this, we developed opjMap, a highly sensitive alignment tool based on orthogonal projection. opjMap is capable of aligning more bases and reads under high error rate conditions while also identifying a greater number of repetitive variations. The opjMap workflow primarily consists of five steps. First, an index of minimizers is built for the reference genome. Next, minimizers are extracted from the read, used to query the index to obtain matching anchors, and the positions of reverse-oriented anchors are recalculated. Subsequently, orthogonal projection is used to project the alignment skeleton onto a straight line, which is then partitioned into windows for a voting process. Different types of repetitive variations are then selected using windows of varying sizes. Finally, after further processing for each type of repetitive variation, a detailed alignment is performed to produce the complete alignment result. Experimental results demonstrate that opjMap exhibits higher sensitivity when aligning sequences with moderate-to-high error rates in both PacBio and ONT platform simulations of real-life data, while also being capable of aligning a greater number of repetitive variations. Methods Overview opjMap employs an orthogonal projection method to align repetitive variations by projecting matched anchor points onto a straight line, followed by a window-based voting approach. The overall process, as illustrated in Fig. 1 , consists of five key steps: (a) Reference Genome Indexing: A hash-based index is constructed for the reference genome to enable efficient lookup of seeds. (b) Generation of Minimizer Anchor Graph: Minimizers are extracted from the query read and using them to construct an anchor graph. (c) Orthogonal Projection and Voting: The anchors from the graph are orthogonally projected onto a straight line. This line is then partitioned into windows, and a voting strategy is applied to identify regions with a high density of projected anchors, which serve as alignment candidates. (d) Localization of Repetitive and Non-Repetitive Regions: opjMap employs a refined localization strategy that uses two distinct window sizes, l and pl , to identify repetitive and non-repetitive Regions. (e) Refined Alignment and Result Merging: The candidate regions identified in step (d) are further refined by pruning the anchor skeleton. A detailed alignment is then performed based on this skeleton. Finally, the alignment results for both the seeded and non-seeded regions are merged to produce a complete alignment result. Reference Genome Indexing To facilitate rapid lookup, a hash-based index is constructed for the reference genome[ 25 ]. This process involves extracting minimizers from the reference and storing each minimizer along with its corresponding position in a hash table[ 26 ]. Generation of Minimizer Anchor Graph Following index construction, minimizers are extracted from the read and are used to query the reference index. Each match is recorded as an anchor tuple m i = ( x i , y i , d i ), where x i is the minimizer's position on the reference, y i is its position on the read, and d i represents its orientation (0 for forward, 1 for reverse). For reverse-oriented minimizers (as shown in Fig. 1 b ), the position y i on the read is recalculated using a specific formula. Given the read r with length len ( r ), the new position is computed using (1). The recalculated positions for reverse-oriented anchors are illustrated in Fig. 1 c. $$\left\{ {\begin{array}{*{20}{l}} {{y_i},}&{{d_i}=0} \\ {len(r) - {y_i},}&{{d_i}=1} \end{array}} \right.$$ 1 Orthogonal Projection and Voting After recalculating the positions of reverse-oriented anchors, all collinear skeleton anchors are positioned along a 45-degree upward-sloping line on the anchor graph. To facilitate the voting process, anchors are orthogonally projected onto a 45-degree downward-sloping line using (2). The projected anchors are recorded as proj i = ( projx i , projy i , d i ). After orthogonal projection, a linear skeleton approximates a single focal point. The difference in anchor count between windows containing a skeleton and those without one is significantly greater after orthogonal projection than it would be without it. This key characteristic allows us to effectively identify the skeleton-containing windows using non-overlapping windows. The projected line is then partitioned into windows of a fixed length l (default 1000 bp). The number of anchors falling into each window is counted, as illustrated in Fig. 1 c, which forms the basis for the subsequent voting strategy. $$\left[ {\begin{array}{*{20}{c}} {proj{x_i}} \\ {proj{y_i}} \end{array}} \right]=\left[ {\begin{array}{*{20}{c}} {\frac{1}{2}}&{ - \frac{1}{2}} \\ { - \frac{1}{2}}&{\frac{1}{2}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{x_i}} \\ {{y_i}} \end{array}} \right]$$ 2 Detailed Alignment. Localization of Repetitive and Non-Repetitive Regions opjMap employs a refined localization strategy that uses two distinct window sizes, l and pl (where p is a hyperparameter), to identify different types of anchor regions, as show in Fig. 1 d. This process avoids the limitations of a standard sliding window by first ranking all windows on the projected line based on their anchor count. It then selects the top m 1 windows of size l as candidates for repeats across distinct reference region, which correspond to either non-repetitive alignments or interspersed repeats where the duplicated segment is external to the read's mapping region in reference. Subsequently, after removing these selected windows, the algorithm re-evaluates the remaining regions using a larger window size of pl . The top m 2 windows of this size are then chosen as candidates for repeats within a single reference region, which are characteristic of complex structures like segmental duplications. This two-stage, multi-size window selection process effectively distinguishes unique alignment regions from those associated with duplication structural variations. Refined the Alignment Skeleton and Detailed Alignment Orthogonal projection and voting are initially used to identify regions containing linear skeletons. Since these initial skeletons often contain noise and can be incomplete, we address this by employing a dynamic programming algorithm to further process these regions and construct a refined alignment skeleton[ 27 ]. As shown in Fig. 2 , this dynamic scoring algorithm can effectively: (a) remove irrelevant anchors (noise), (b) merge skeletons that span multiple windows to form a complete skeleton, and (c) construct skeletons for segmental repeats, which facilitates subsequent detailed Overlapping skeleton structures can be complex, and with relatively short fragments in the read, the skeleton information is often sparse, which can compromise the alignment quality. Therefore, opjMap extracts shorter minimizers of length 9 from these regions to construct a more informative and complete skeleton for detailed alignment. After a high-quality alignment skeleton is obtained, opjMap extends it at both ends to ensure the completeness of the alignment region. Finally, a basic alignment algorithm is used for a detailed alignment of the non-seed regions[ 28 ], and the results are merged with those of the seed regions to produce a complete alignment (as shown in Fig. 1 e). Results Overview To evaluate the performance of opjMap, we conducted a comparative analysis against widely used long-read aligners: minimap2[ 29 ], NGMLR[ 30 ] and Winnowmap2[ 31 ]. All alignment methods were tested on both simulated and real single-molecule sequencing datasets. The experiments were performed on a server running the Ubuntu 22.04 operating system, equipped with 189 GB of RAM and two Intel Xeon E5-2686 v4 processors (2.30 GHz, 16 cores, 32 threads each). Simulated Data Experiments Alignment Evaluation for Non-Structural Variation Reads Evaluating the alignment performance of different tools involved using PBSIM2[ 32 ] to generate simulated reads with known reference positions, thereby enabling a precise comparison of alignment quality. We generated four sets of simulated reads with varying error rates: 10% and 15% to mimic the PacBio platform, and 20% and 30% to represent the ONT platform. The reads were generated from the chromosome 1 sequence of H.sapiens . The commands used for generating these datasets are provided in supplementary file Table S1 . Given the inherent error rate of sequencing reads, a base is considered correctly aligned if its mapped position on the reference genome differs from its true simulated position by no more than w bases (where w = 5). A read is considered correctly aligned if more than 90% of its bases are correctly mapped[ 33 ]. Base-level accuracy is defined as the ratio of correctly aligned bases to the total number of aligned bases[ 34 ], while sensitivity is the ratio of correctly aligned bases to the total number of bases in the simulated dataset. Similarly, read-level accuracy is the ratio of correctly aligned reads to the total number of aligned reads, and sensitivity is the ratio of correctly aligned reads to the total number of reads in the simulated dataset. The specific commands used for aligning with each tool are provided in supplementary file Table S2. The resulting alignment data are presented in Table 1 . The values in parentheses in the Accuracy and Sensitivity columns indicate the percentage difference relative to opjMap. For example, minimap2's base-level accuracy at a 10% error rate is 95.40 (-0.13) %, where − 0.13% signifies that it is 0.13% lower than opjMap. At a simulated error rate of 10%, opjMap demonstrated higher sensitivity in both base-level and read-level alignments compared to all other tools, with the exception of minimap2, to which it was slightly inferior. For error rates of 15%, 20% and 30%, opjMap consistently exhibited superior sensitivity at both base and read levels compared to the other aligners. Although other tools achieved higher accuracy at the base level, their read-level alignment accuracy was consistently lower than that of opjMap. These results collectively suggest that the orthogonal projection-based opjMap offers high sensitivity under moderate to high error rate conditions, enabling it to align a greater number of bases and accurately map more reads. Alignment Evaluation for Duplications Across Distinct Reference Regions We evaluated the tools' ability to detect interspersed repeats located outside the read's corresponding gene. We generated sequences containing repeats using a custom script, randomly selecting the strand for each fragment. Unlike PBSIM, Badread[ 35 ] can introduce sequencing errors into a short sequence, simulating its output under various error rates. Using Badread, we added sequencing errors to the fragments (see Supplementary Table S3 for specific commands) and then used a script to select simulated sequences with repetitive variations that met our criteria. Due to the random nature of the simulation, the number of reads in each dataset varied. To select an appropriate error rate for comparison, we first tested the sensitivity and accuracy of different methods for aligning variations in 1000 bp sequences. The results are shown in Supplementary Fig. S1 . opjMap demonstrated a significant lead in both accuracy and sensitivity under high error rates, with this gap only narrowing when the error rate Table 1 Results of different methods on simulated dataset Error Rate ( Number of Reads ) Alignment Tool Base Level Read Level Number of Alignments(M) Correct Alignments(M) Accuracy (%) Sensitivity (%) Number of Alignments Correct Alignments Accuracy (%) Sensitivity (%) 10% ( 241144 ) opjMap 2,347 2,242 95.53 89.94 217354 216563 99.64 89.81 minimap2 2,350 2,242 95.40(-0.13) 89.96(+0.02) 217318 216660 99.70(-0.06) 89.85(+0.04) Winnowmap2 2,273 2,231 98.14(+2.61) 89.50(-0.44) 215791 214798 99.54(-0.10) 89.07(-0.74) ngmlr 2,234 2,225 99.57(+4.04) 89.25(-0.69) 216443 214058 98.90(-0.74) 88.77(-1.04) 15% ( 239772 ) opjMap 2,320 2,232 96.22 89.55 215443 213962 99.31 89.24 minimap2 2,313 2,212 95.61(-0.61) 88.74(-0.81) 214111 211273 98.67(-0.64) 88.11(-1.13) Winnowmap2 2,112 2,064 97.72(+1.50) 82.81(-6.74) 198830 195513 98.33(-0.98) 81.54(-7.70) ngmlr 2,182 2,170 99.45(+3.23) 87.07(-2.48) 212526 205323 96.61(-2.70) 85.63(-3.61) 20% ( 127587 ) opjMap 2,322 2,233 96.14 89.58 114860 114074 99.32 89.41 minimap2 2,326 2,233 96.03(-0.11) 89.60(-0.34) 114850 114045 99.30(-0.02) 89.39(-0.02) Winnowmap2 2,224 2,205 99.17(+3.03) 88.48(-1.46) 113582 111987 98.60(-0.72) 87.77(-1.64) ngmlr 2,221 2,206 99.35(+3.21) 88.52(-1.42) 114446 111828 97.71(-1.61) 87.65(-1.76) 30% ( 118334 ) opjMap 2,258 2,203 97.57 88.40 105809 104154 98.44 88.02 minimap2 2,216 2,142 96.64(-0.93) 85.93(-2.47) 103884 98286 94.61(-3.83) 83.06(-4.96) Winnowmap2 1,386 1,357 97.94(+0.37) 54.46(-33.94) 69471 57197 82.33(-16.11) 48.34(-39.68) ngmlr 2,052 2,036 99.25(+1.68) 81.70(-6.70) 101117 90927 89.92(-8.52) 76.84(-11.18) dropped to between 15% and 10%. We chose an error rate of 15% to test the alignment of repeats of different lengths. For this test, we set five different lengths for external repetitive variations: 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp. The average sequence length was 10,000 bp, with 3,000 sequences in each length group. For more detailed information on these two sets of reads with different error rates and lengths, please refer to Supplementary Tables S4 and S5. Due to the presence of base-level errors in sequencing reads, the position of repeat variations within the reads is affected. The experiment determined whether a repeat variation was detected by verifying if the read's corresponding variation position on the reference genome was aligned multiple times[ 36 ]. Specifically, we defined the position and orientation of an aligned read on the reference genome as G = ( G st , G ed , G d ), and the true position of the variation on the reference as T = ( T st , T ed , T d ). If the number of non-empty intersections between G and T was greater than or equal to ( n + 1), and the alignment orientation was identical, where n is the number of repeats ( n = 1), the alignment was considered correct. Figure 3 illustrates the accuracy and sensitivity of this detection at different lengths (for specific commands, see Supplementary Table S6, and for numerical values, see Supplementary Table S7). As the length of the repetitive variation region increases, the accuracy and sensitivity of the alignment tools also increase. Because the simulated repeat sequences were all fragments extracted from the reference genome, and their length was around 10,000 bp, most alignment tools were able to align the entire sequence. This resulted in the sensitivity and accuracy of many results being identical. Throughout the length-based experiments, opjMap consistently outperformed other tools, achieving 100% sensitivity and accuracy in detecting repetitive variations when the repeat length was 5000 bp. Overall, opjMap maintained high accuracy and sensitivity in detecting duplications across distinct reference regions, regardless of variations in length or error rate. This indicates that opjMap is capable of identifying a greater number of inter-regional repetitive variations, even under conditions of high error rates and short variation lengths. Alignment Evaluation for Duplications within a Single Reference Region The experiments also tested the detection of repetitive regions located within the reads, with two distinct types of variations: interspersed repeats with a single duplication event and contiguous segmental duplications with multiple repeats. Single Duplication Event We evaluated the tools' detection capabilities by fixing the repeat fragment length at 1000 bp and varying the sequencing error rates. The detailed results are shown in Supplementary Fig S2. At high sequencing error rates, opjMap maintained a high level of performance. As the error rate decreased, the alignment results of other tools began to approach those of opjMap. We then chose an error rate of 15% for the subsequent experiment, which was designed to test the alignment of repeat fragments of different lengths. For this, we generated sequences with this error rate, containing internal repeats of 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp. Detailed information on these two datasets with varying error rates and lengths can be found in Supplementary Tables S8, S9. Figure 4 illustrate the accuracy and sensitivity at different fragment lengths, with specific numerical values available in Supplementary Table S10. opjMap showed higher alignment sensitivity and accuracy when the repeat fragments were short. As the length of the repeat fragments increased, the performance of other tools approached that of opjMap. This indicates that opjMap is suitable for detecting interspersed repeats in a wide range of scenarios. Contiguous Segmental Duplication For the comparison of segmental duplication detection, a custom script was used to generate sequence fragments of five different lengths (100 bp, 250 bp, 500 bp, 750 bp, and 1000 bp), with each length repeated 10 times. We then introduced sequencing errors at a rate of 15% using the Badread tool. Detailed read information can be found in Supplementary Table S11. Given the high number of repeats, we fixed the fragment length at 1000 bp and initially tested the sensitivity for repeat judgment thresholds ( n ) of 3, 5, 7, and 10. The results are shown in Fig. 5 , with specific values available in Supplementary Table S12. From these results, it can be seen that Winnowmap2 is not well-suited for aligning segmental repeats. In contrast, opjMap maintained high sensitivity as the threshold n increased. A repeat judgment threshold ( n ) of 10 was selected to evaluate the performance of alignment tools on repeats. The results are shown in Fig. 6 , with specific numerical values available in Supplementary Table S13. As the figure illustrates, opjMap surpassed the other aligners in both accuracy and sensitivity for detecting segmental duplications. opjMap achieves this by extracting shorter sub-fragments of length 9 from overlapping regions. This demonstrates that constructing the alignment skeleton with shorter fragment information can effectively enhance the detection of repeat region information. Figure 7 presents a comparison of opjMap with three other tools, visualized using the IGV alignment visualization tool. Figure 7 a shows the true skeleton anchor graph for a segmental tandem repeat of 500 bp, repeated 10 times. From this, it can be seen that 6 of the repeats are in the forward direction and 4 are in the reverse direction. Figure 7 b shows the alignment results for this sequence from all four tools. We can observe that opjMap successfully aligned 4 reverse-oriented and 5 forward-oriented segmental repeats. In comparison, NGMLR aligned 2 reverse-oriented and 4 forward-oriented repeats. Winnowmap2 failed to recognize this repeat region, and minimap2 produced only a small number of alignment results. These findings demonstrate that opjMap is capable of identifying a greater number of segmental repeats, yielding more comprehensive alignment results. This indicates that opjMap possesses a superior ability to align segmental repeat variations. Real Data Experiments Evaluation on Datasets Without Segmental Repeats A comparison of alignment performance on real-world datasets was conducted using sequencing data from two platforms: PacBio and ONT. The PacBio dataset, from A.thaliana , contained 300,000 sequences, while the ONT dataset, from E.coli , contained 60,000 sequences. All experiments were run using 64 threads, and the alignment results are presented in the table below. As shown in the Table 2 , opjMap aligns a greater number of bases and reads on both the PacBio and ONT platforms while maintaining a lower consumption of computational resources. minimap2's performance is close to opjMap's, whereas NGMLR consumes significantly more resources. Table 2 Results of different methods on real dataset. DataSet (Read number) Aligner Mapped bases Mapped reads CPU time (seconds) Wall time (seconds) Peak Memory (GB) PacBio (304718) opjMap 5492704427 292604 48032 990 26.3 minimap2 5456246662 290099 67325 1150 25.4 Winnowmap2 5251174215 280013 80924 1353 40.1 NGMLR 4362237072 255632 321851 5134 39.2 ONT (62094) opjMap 413018134 53917 691 15 13.8 minimap2 412818972 53665 495 13 15.5 Winnowmap2 403409572 52908 1119 26 28.4 NGMLR 365331379 49943 19117 395 39.0 Evaluation on Datasets With Segmental Repeats To compare the performance of different alignment tools on segmental repeat variations in real-world sequencing data, we used long-read sequencing datasets from the human genomes T2T-CHM13 and HG002[ 37 ]. T2T-CHM13, considered the first complete and gapless human reference genome, serves as an ideal benchmark for evaluating and improving genomic alignment and variant calling algorithms. The HG002 dataset, on the other hand, consists of high-quality sequencing data from a real human sample. As existing structural variation benchmark sets lack sufficient information on segmental repetitive variations, we programmatically inserted 2,300 segmental repeat sequences into the T2T-CHM13 reference genome at regions corresponding to the original reads. The length distribution is shown in Supplementary Fig. S3. Table 3 Comparison of Mappers for Segmental Repeat Detection on a Reference Genome Aligner opjMap minimap2 NGMLR Winnowmap2 Total 2300 2297 1705 2140 Correct 1893 1878 44 1450 Acc (%) 82.3% 81.76% 2.58% 67.46% Sen (%) 82.3% 81.65% 1.91% 63.04% As shown in the Table 3 , opjMap achieved both an accuracy and sensitivity of 82.3%, outperforming all other alignment tools. minimap2 followed closely behind, while both NGMLR and Winnowmap2 performed poorly in aligning segmental repetitive variations. Notably, segmental repeats occurring within the reference genome are more challenging to detect than those in the reads. Due to its orthogonal projection-based approach, opjMap exhibits higher sensitivity when dealing with a reference genome containing segmental repeats, allowing it to identify a greater number of variations. Discussion Alignment of repetitive structural variations in long reads with high error rates presents a significant challenge. When aligning such reads to a reference genome, the high error rate often leads to overlapping alignment skeletons, which many existing tools struggle to handle effectively. To overcome this issue, we propose opjMap, an alignment tool based on orthogonal projection. opjMap projects the linear alignment skeleton onto a straight line, enabling highly sensitive localization of the skeleton. This method allows opjMap to identify a greater number of reads on the reference genome. After locating the skeleton, opjMap extracts shorter minimizers from the repetitive regions to gather more detailed alignment information, thereby aligning a greater number of bases and improving overall alignment quality. opjMap achieves high localization sensitivity while maintaining a low computational complexity. Unlike dynamic programming algorithms, which perform scoring and backtracking on window anchors to select collinear seeds—with an optimized time complexity approaching O ( n log n ), where n is the number of anchors—opjMap's approach is more efficient. Because the number of windows is significantly smaller than the number of anchors, our method primarily focuses on projecting and counting each anchor, resulting in a time complexity closer to O ( n ). After the projection and voting step, opjMap utilizes radix sort to count the anchors within each window, selecting windows with a high vote count as alignment candidates. However, due to sequencing errors, two linear alignment skeletons within a read can become misaligned, which might lead to them being incorrectly projected into separate windows, thereby reducing read alignment sensitivity. To mitigate this issue, opjMap's projection process strategically increases the window length to place these misaligned skeletons within a single window. While this approach enhances read detection sensitivity, it can make it challenging to identify the specific structural variation information within the window, thus lowering the sensitivity for detecting internal variations. In future work, we plan to develop targeted processing methods for the alignment skeletons within these voted windows to further improve the sensitivity of structural variation alignment. Conclusions In this work, we propose a novel orthogonal projection-based voting localization method. This approach effectively avoids introducing excessive noise during the candidate region selection process, thereby satisfying the requirement for selecting collinear seeds. The method significantly reduces computational time complexity, and its use of orthogonal projection effectively filters out noise, which is beneficial for subsequent skeleton construction and detailed alignment. Experimental results demonstrate that our method can align a greater number of reads and bases under moderate-to-high sequencing error rates. Furthermore, it is also capable of aligning a higher number of repetitive variations, confirming its robustness and effectiveness. Abbreviations SMS single-molecule sequencing SMRT Single Molecule Real-Time ONT Oxford Nanopore Technologies SVs Structural variations RHT Regional hash table FM-index Full-text Minute-space index Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and material All data in this paper is available in the supplementary file or from the corresponding author on a reasonable request. Competing interests Not applicable. Funding This work was supported in part by the Scientific Research General Project of Wuhan Technology And Business University under Grant A2025044 and was also supported by the Special Fund of Advantageous and Characteristic Disciplines (Group) of Hubei Province. Availability of data and materials The datasets used in this study, along with the corresponding reference genomes, are publicly available from the NCBI and EBI repositories. Real Datasets: Raw reads from Escherichia coli (ONT platform), Arabidopsis thaliana (PacBio platform), and Homo sapiens (PacBio platform) were obtained from the following sources: E. coli : https://www.ncbi.nlm.nih.gov/sra/?term=SRR34757056%2F A. thaliana : https://www.ncbi.nlm.nih.gov/sra/?term=ERR15092965 H. sapiens : https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/reads/ Reference Genomes: The reference genomes for E. coli , A. thaliana , and H. sapiens can be accessed through these links: E. coli : https://www.ebi.ac.uk/ena/browser/view/ERX987748 A. thaliana : https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001735.4/ H. sapiens : https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/ Authors’ contributions Xing-Guo Fan developed the programming of the alignment tool and drafted the manuscript. Xiao-Dan Zhang carried out the revision of the manuscript. Cheng-Song Hu conducted the analysis of the experimental results. Jie-Jie Zeng and Shu-Rui Li executed the testing of the tool. Ze-Gang Wei provided the reference genome, reads, and computational infrastructure. All authors contributed to the conception and design of the study, discussed the results, and read, edited, and approved the final manuscript. Acknowledgements Not applicable. References Beran P, et al. KEC: unique sequence search by k-mer exclusion. Bioinf (Oxford England). 2021;37(19):btab196. Charalampous T, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37(7):783–92. Wei Z-G, Zhang S-W. DMclust, a density-based modularity method for accurate OTU picking of 16S rRNA sequences. Mol Inf. 2017;36(12):1600059. Wei Z-G, Zhang S-W. MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs. Mol BioSyst. 2015;11(7):1907–13. Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008;9(1):128. Hedges DJ, et al. Evidence of novel fine-scale structural variation at autism spectrum disorder candidate loci. Mol autism. 2012;3:1–11. Pan B, et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics. 2019;20:17–29. Ferragina P, Manzini G. Opportunistic data structures with applications. In: Symposium on Foundations of Computer Science; 2000. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010;11(5):473. Zhang H, et al. Fast and efficient short read mapping based on a succinct hash index. BMC Bioinformatics. 2018;19(1):92. Kaur H, Chand L. Biological sequence alignment using varied optimization algorithms. International Conference on Inventive Computation Technologies. Berlin: Springer; 2016. pp. 1–5. Xu X et al. SLPal: Accelerating long sequence alignment on many-core and multi-core architectures. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020: pp. 2242–2249. Poerba YS, Martanti D. Genetic variability of Amorphophallus muelleri Blume in Java based on random amplified polymorphic DNA. Biodiversitas J Biol Divers, 2008. 9(4). Savage DG, et al. Clinical features at diagnosis in 430 patients with chronic myeloid leukaemia seen at a referral centre over a 16-year period. Br J Haematol. 1997;96(1):111–6. Ivan S, et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun. 2016;7:11307. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. Sedlazeck FJ et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods, 2018. 15(6). Wei ZG, et al. kngMap: sensitive and fast mapping algorithm for noisy long reads based on the K-Mer neighborhood graph. Front Genet. 2022;13:890651. Liu B, et al. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics. 2016;32(11):1625–31. Haghshenas E, et al. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics. 2019;35(1):20–7. Lippert RA. Space-efficient whole genome comparisons with Burrows–Wheeler transforms. J Comput Biol. 2005;12(4):407–15. Takahashi KK, Innan H. Duplication with structural modification through extrachromosomal circular and lariat DNA in the human genome. Sci Rep. 2020;10(1):7150. Rasko DA, et al. Origins of the E. coli strain causing an outbreak of hemolytic–uremic syndrome in Germany. N Engl J Med. 2011;365(8):709–17. Murray IA, et al. The methylomes of six bacteria. Nucleic Acids Res. 2012;40(22):11450–62. Ning Z, et al. SSAHA: a fast search method for large DNA databases. Genome Res. 2001;11(10):1725–9. Roberts M, et al. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363–9. Liu B, et al. deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics. 2016;32(21):3224–32. Wei ZG, et al. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics. 2023;39(12):btad726. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. Sedlazeck FJ et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods, 2018. 15(6). Jain C, Rhie A, Hansen NF et al. Long-read mapping to repetitive reference sequences using Winnowmap2. 2022; 19:705–10. Ono Y, Asai K, Hamada MJB. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. 2020. Wei Z-G, Zhang S-W. NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinformatics. 2018;19(1):177. Wei Z-G, Zhang S-W, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics. 2020;21(1):341. Wick RR. Badread: simulation of error-prone long reads. J Open Source Softw. 2019;4(36):1316. Wei ZG, et al. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics. 2023;39(12):btad726. Mitchell R, Vollger, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376:eabj6965. Additional Declarations No competing interests reported. Supplementary Files supplymentaryfile.docx Cite Share Download PDF Status: Under Revision Version 1 posted Editorial decision: Revision requested 05 Jan, 2026 Reviews received at journal 27 Dec, 2025 Reviews received at journal 20 Dec, 2025 Reviewers agreed at journal 26 Nov, 2025 Reviewers agreed at journal 18 Nov, 2025 Reviewers invited by journal 17 Nov, 2025 Editor assigned by journal 06 Nov, 2025 Editor invited by journal 06 Nov, 2025 Submission checks completed at journal 06 Nov, 2025 First submitted to journal 05 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7929852","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":549571258,"identity":"783738de-200f-4a96-bd42-616b473aa801","order_by":0,"name":"Xing-Guo Fan","email":"","orcid":"","institution":"Wuhan Technology and Business University","correspondingAuthor":false,"prefix":"","firstName":"Xing-Guo","middleName":"","lastName":"Fan","suffix":""},{"id":549571259,"identity":"f5743ec4-f5bc-437f-a414-5ae83934e930","order_by":1,"name":"Xiao-Dan Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABAklEQVRIie3QsWrDMBCAYQWDvFyT9UrS5hUuBEwHQ16gD3EmoMlDIJClGQwGe8zq1yiFzgkCTXmAQDx08tTB2QrN0MYeW5l066B/uEHwoZOEcLn+YQPP2xITgvTT3ZsQ28shdZLbPONFvQhHfTBzEnwFof2eTkWtwnt8DPAqIg48eQHSIFGqp5uPMkr89BXFurSKXsHzaUOGqTkCV1ECZoXCVFbiIZuWjIw6xqyjBOMAe4m2EolR/tkuFgfLhozfuwmAFpOCVEO89hboJuhngmoKQX5/8vCsqmkGavnAxk5melATn3E2ztPdqQjLu42vnw/12k5+ed1l8B+Ay+VyuX72BWHGV4HrSIVlAAAAAElFTkSuQmCC","orcid":"","institution":"Baoji University of Arts and Sciences","correspondingAuthor":true,"prefix":"","firstName":"Xiao-Dan","middleName":"","lastName":"Zhang","suffix":""},{"id":549571260,"identity":"e457d462-91e9-4792-a568-737b4a12a20f","order_by":2,"name":"Cheng-Song Hu","email":"","orcid":"","institution":"Wuhan Technology and Business University","correspondingAuthor":false,"prefix":"","firstName":"Cheng-Song","middleName":"","lastName":"Hu","suffix":""},{"id":549571261,"identity":"4c65c53b-a63e-4d05-a79f-4a17bc653812","order_by":3,"name":"Jie-Jie Zeng","email":"","orcid":"","institution":"Wuhan Technology and Business University","correspondingAuthor":false,"prefix":"","firstName":"Jie-Jie","middleName":"","lastName":"Zeng","suffix":""},{"id":549571262,"identity":"2613a35b-7749-4163-a60f-02d43fed1f0a","order_by":4,"name":"Shu-Rui Li","email":"","orcid":"","institution":"Wuhan Technology and Business University","correspondingAuthor":false,"prefix":"","firstName":"Shu-Rui","middleName":"","lastName":"Li","suffix":""},{"id":549571263,"identity":"ef5571f2-2c3e-498f-92f1-195d8fcae491","order_by":5,"name":"Ze-Gang Wei","email":"","orcid":"","institution":"Baoji University of Arts and Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ze-Gang","middleName":"","lastName":"Wei","suffix":""}],"badges":[],"createdAt":"2025-10-23 08:08:23","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7929852/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7929852/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96806617,"identity":"66e6921d-e704-46a3-8f2e-3ba84534a544","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2301839,"visible":true,"origin":"","legend":"","description":"","filename":"opjMapBMC20251031.docx","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/88e02b1b03490cfce8183a85.docx"},{"id":96918178,"identity":"c9253fce-c87a-47db-a575-acf47324900e","added_by":"auto","created_at":"2025-11-27 14:11:15","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8370,"visible":true,"origin":"","legend":"","description":"","filename":"7465cff4d5b444239f4de212326ab028.json","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/12cc66423270f5858d279105.json"},{"id":96806613,"identity":"3c754e13-fe9b-4109-ba04-ea0b5c99d475","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1308431,"visible":true,"origin":"","legend":"","description":"","filename":"supplymentaryfile.docx","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/1d3d2eb6be668648d0d351e7.docx"},{"id":96918650,"identity":"e0f44e75-a3fd-4a71-88c0-0aa188c7e82b","added_by":"auto","created_at":"2025-11-27 14:12:15","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":107611,"visible":true,"origin":"","legend":"","description":"","filename":"7465cff4d5b444239f4de212326ab0281enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/cea633f78d41ac043c283be3.xml"},{"id":96917499,"identity":"d603c41b-320a-4136-9987-4e15e48a4fe3","added_by":"auto","created_at":"2025-11-27 14:09:53","extension":"emf","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3198228,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.emf","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/6b1d8ebdc5525ec2faa143c2.emf"},{"id":96918254,"identity":"32368e86-d788-4a9d-9c3c-924a93cd1abe","added_by":"auto","created_at":"2025-11-27 14:11:32","extension":"emf","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1665784,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.emf","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/d8b0f049841791f08a6a519d.emf"},{"id":96806618,"identity":"3d590171-b2b0-4fa8-be3a-4c6b5069ea48","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":188219,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/f0d00bb11a9359137e8412bd.png"},{"id":96806624,"identity":"013be218-1226-44de-b6a5-9774490607ee","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":182230,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/2fdb48c4e484cd09f177b4e9.png"},{"id":96806628,"identity":"909c7660-ff67-4dc4-ad42-87e0be6428b5","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":183025,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/719d9f57cda64614b929b185.png"},{"id":96806627,"identity":"a25c2e19-26fe-45a3-aa58-b40fdb7f1b18","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"emf","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1577124,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.emf","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/e144ba1378ae5ca02e085b4a.emf"},{"id":96917353,"identity":"30d8170f-295a-41a2-a73a-f6b9ddcf7482","added_by":"auto","created_at":"2025-11-27 14:09:35","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":42474,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/8887dfc42be181bdb35c13e8.png"},{"id":96916980,"identity":"f1475504-2517-480c-8935-6a26e030ef3b","added_by":"auto","created_at":"2025-11-27 14:09:06","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":20132,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/95957c5691640f090a0c2712.png"},{"id":96918682,"identity":"386872c3-d2f7-4ab5-babf-e7e36e2fcfa4","added_by":"auto","created_at":"2025-11-27 14:12:20","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":70023,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/9a238a383fb0262c792653c3.png"},{"id":96806620,"identity":"c54a6103-f4d4-4c07-90f7-52e94179f7f7","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":69072,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/9eef0d306c06260ee47cd618.png"},{"id":96806621,"identity":"4e02b448-fcb2-4a3e-9e3e-6d172dcf4748","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":72044,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/59c22eb91867db8c2cf405bb.png"},{"id":96806625,"identity":"7bad6ac5-7c9f-4f22-b6a4-3ccf14c06907","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":48710,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/d8d6ef80613374694444e66a.png"},{"id":96918028,"identity":"7a6467de-f162-448c-8f3e-6765fe9543cb","added_by":"auto","created_at":"2025-11-27 14:11:01","extension":"xml","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":107973,"visible":true,"origin":"","legend":"","description":"","filename":"7465cff4d5b444239f4de212326ab0281structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/48ab83b58430dae8681d397b.xml"},{"id":96917963,"identity":"79593823-6e16-48bd-a360-2ace0a5d629c","added_by":"auto","created_at":"2025-11-27 14:10:52","extension":"html","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":116061,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/84d01d1973708c5641411dd2.html"},{"id":96806604,"identity":"a5be5344-f781-4f6f-99d0-e18dfc34cac7","added_by":"auto","created_at":"2025-11-26 09:19:46","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":126299,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic representation of stages in opjMap. (a) Building a searching index of the reference genome. (b) Generation of a minimizer anchor graph. Anchors in the forward direction are shown in gray, while reverse-oriented anchors are shown in red. The positions of all reverse-oriented anchors are then recalculated to align them to a common axis. (c) Orthogonal Projection and Voting: After position recalculation, all anchors are orthogonally projected onto a single line. This line is partitioned into windows, and a voting process is performed to identify high-density anchor regions, as shown by the bar chart. (d) Localization of Repetitive and Non-Repetitive Regions: Based on the voting results, opjMap differentiates between non-repetitive regions and two types of repetitive regions. This step yields the raw alignment skeleton. (e) Refinement of the Alignment Skeleton and\u003c/p\u003e\n\u003cp\u003eDetailed Alignment.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/148b825f400c94848aad3947.png"},{"id":96918145,"identity":"efe4b8e2-f5e2-422f-82ca-b276a4cd6217","added_by":"auto","created_at":"2025-11-27 14:11:12","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":59017,"visible":true,"origin":"","legend":"\u003cp\u003ePruning the raw skeleton using a dynamic programming algorithm. The blue shaded areas represent the selected windows containing the raw linear skeletons: (a) Denoising the raw skeleton; (b) Constructing a skeleton across multiple windows; (\u003cstrong\u003ec\u003c/strong\u003e) Constructing skeletons for segmental repeats.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/d8d098b2b68e77bbf9d09d66.png"},{"id":96806606,"identity":"3db5fcca-10f0-419f-b8b8-64f81a76e5ea","added_by":"auto","created_at":"2025-11-26 09:19:46","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":56005,"visible":true,"origin":"","legend":"\u003cp\u003eThe accuracy and sensitivity for duplications across distinct reference regions of this detection method under different lengths.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/1db09ba1833927aa87e5d9a7.png"},{"id":96917518,"identity":"926924f3-0f5d-487d-8187-c09b461feea3","added_by":"auto","created_at":"2025-11-27 14:09:56","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":56133,"visible":true,"origin":"","legend":"\u003cp\u003eThe accuracy and sensitivity for duplication within a single reference region under different lengths\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/7063210c7476686fd7426bfb.png"},{"id":96806612,"identity":"ba5281a4-2271-49f9-999a-e68c863eae77","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":18066,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of detection sensitivity for segmental duplications across different thresholds (\u003cem\u003en\u003c/em\u003e)\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/b6fbad84b7b68eb880c4ed7f.png"},{"id":96806610,"identity":"4e0a0bd0-119c-4406-826c-2fdf400dec91","added_by":"auto","created_at":"2025-11-26 09:19:47","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":56632,"visible":true,"origin":"","legend":"\u003cp\u003eThe accuracy and sensitivity of the alignment tools in handling repeats (\u003cem\u003en\u003c/em\u003e= 10)\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/41a9a09bdc2241704896705f.png"},{"id":96806609,"identity":"a5320a03-d819-44e2-9129-accf5866aa70","added_by":"auto","created_at":"2025-11-26 09:19:46","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":237939,"visible":true,"origin":"","legend":"\u003cp\u003eVisualization of alignment results for segmental repeat regions. (a) True skeleton anchor graph for a segmental tandem repeat of 500 bp, repeated 10 times, showing a total of 10 repeat segments. (b) Alignment results for the same read using four different tools. opjMap produced 9 segmental repeats, NGMLR produced 6, while minimap2 and Winnowmap2 yielded only 3 and 0 alignment results, respectively.\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/4371669b3948ea5f00728c4d.png"},{"id":97135992,"identity":"0e62508f-69f7-4555-ae61-aa6cfd8f4f75","added_by":"auto","created_at":"2025-12-01 09:54:54","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1725840,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/5710e685-9171-404e-945b-cce32a2e23c0.pdf"},{"id":96917199,"identity":"1450f81d-446f-4003-9e78-abeed1c32ae0","added_by":"auto","created_at":"2025-11-27 14:09:21","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1308431,"visible":true,"origin":"","legend":"","description":"","filename":"supplymentaryfile.docx","url":"https://assets-eu.researchsquare.com/files/rs-7929852/v1/ae5ebad1af50775e026307ea.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"opjMap: A Sensitive Mapper for Repetitive Structural Variations in Long Noisy Reads Based on Orthogonal Projection","fulltext":[{"header":"Background","content":"\u003cp\u003eSequence alignment is a fundamental technique in bioinformatics and serves as the cornerstone for subsequent biological sequence analyses[\u003cspan additionalcitationids=\"CR2 CR3 CR4\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Biological sequences, typically obtained through sequencing technologies, are continuous chains of nucleotides (such as DNA, RNA) or amino acids (such as proteins)[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. These sequences encode the complete genetic information of an organism and are crucial for essential life activities, including growth, development, and metabolism. The primary purpose of sequence alignment is to determine the similarity between biological sequences, which in turn facilitates the study of species homology and evolutionary processes[\u003cspan additionalcitationids=\"CR8 CR9 CR10\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThird-generation sequencing technologies are characterized by a high error rate, which makes it challenging to directly and accurately align the reads to a reference genome[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. For this reason, most state-of-the-art third-generation sequence alignment algorithms utilize a seed-and-extend approach[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. When sequencing errors are present, the overall read may not perfectly match a local region of the reference genome, but the two sequences will share numerous identical short substrings (seeds)[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. A key principle is that a reference region containing more shared seeds is more likely to be the correct mapping location for the read. Based on the seeding strategy employed, existing third-generation alignment methods can be broadly categorized into two types: dynamic programming-based alignment and voting-based alignment.\u003c/p\u003e\u003cp\u003eDynamic programming-based alignment methods select regions with a high density of collinear seeds, effectively filtering out irrelevant seeds (noise) to enhance accuracy. These collinear seeds can serve as a basic skeleton for base-to-base alignment, which is why this approach is widely adopted. Several notable algorithms exemplify this strategy. GraphMap[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] utilizes a hash-based indexing technique and a conservative, stepwise filtering strategy for candidate regions, achieving high sensitivity and speed. Minimap2[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] pioneered the use of minimizers\u0026mdash;seeds with the smallest hash values within a given window\u0026mdash;to construct its index. This approach has demonstrated superior performance in aligning long reads with high error rates. NGMLR[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] employs a convex gap-scoring model to handle gaps between skeleton segments, enabling it to effectively align sequences with minor insertions, deletions, and large-scale structural variations. kngmap[\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] focuses on identifying the maximum number of collinear seeds for localization, which allows it to align a greater number of reads and bases. Furthermore, it leverages gap lengths between skeleton seeds to identify structural variation types, demonstrating a capability to align a range of structural variants.\u003c/p\u003e\u003cp\u003eVoting-based alignment methods, in contrast, statistically rank candidate windows by counting the number of shared seeds and selecting the top \u003cem\u003em\u003c/em\u003e windows for further analysis. This approach has lower computational complexity than dynamic programming and provides a more holistic view by considering multiple overlapping regions, but it is more susceptible to including noise. rHAT[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] improves alignment speed and quality by using overlapping windows on the reference genome and extracting \u003cem\u003ek\u003c/em\u003e-mers from the read for efficient lookup. lordFAST[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e] enhances this approach by considering not only the number of seeds during localization but also their length. By combining hash indexing with the FM-index[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], lordFAST achieves superior performance in both alignment speed and memory utilization.\u003c/p\u003e\u003cp\u003eDespite the development of two main third-generation sequencing alignment approaches\u0026mdash;dynamic programming-based and voting-based\u0026mdash;to effectively handle long reads with high error rates, these methods still face limitations when confronted with duplication, such as interspersed repeats and segmental duplications. Specifically, dynamic programming algorithms struggle to process overlapping variant skeletons effectively, while voting-based methods are susceptible to noise, leading to reduced alignment sensitivity and quality for duplication.\u003c/p\u003e\u003cp\u003eDuplication plays a crucial role in significant genomic structural changes and is fundamental to biological evolution[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. However, the complex structure of duplication often compromises the sensitivity of existing third-generation alignment tools in detecting and aligning these variations[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. Consequently, the development of specialized tools that can leverage the unique characteristics of repeat variation has become a pressing issue in the advancement of third-generation sequencing technologies[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. To address this, we developed opjMap, a highly sensitive alignment tool based on orthogonal projection. opjMap is capable of aligning more bases and reads under high error rate conditions while also identifying a greater number of repetitive variations. The opjMap workflow primarily consists of five steps. First, an index of minimizers is built for the reference genome. Next, minimizers are extracted from the read, used to query the index to obtain matching anchors, and the positions of reverse-oriented anchors are recalculated. Subsequently, orthogonal projection is used to project the alignment skeleton onto a straight line, which is then partitioned into windows for a voting process. Different types of repetitive variations are then selected using windows of varying sizes. Finally, after further processing for each type of repetitive variation, a detailed alignment is performed to produce the complete alignment result. Experimental results demonstrate that opjMap exhibits higher sensitivity when aligning sequences with moderate-to-high error rates in both PacBio and ONT platform simulations of real-life data, while also being capable of aligning a greater number of repetitive variations.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eOverview\u003c/h2\u003e\u003cp\u003eopjMap employs an orthogonal projection method to align repetitive variations by projecting matched anchor points onto a straight line, followed by a window-based voting approach. The overall process, as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, consists of five key steps: (a) Reference Genome Indexing: A hash-based index is constructed for the reference genome to enable efficient lookup of seeds. (b) Generation of Minimizer Anchor Graph: Minimizers are extracted from the query read and using them to construct an anchor graph. (c) Orthogonal Projection and Voting: The anchors from the graph are orthogonally projected onto a straight line. This line is then partitioned into windows, and a voting strategy is applied to identify regions with a high density of projected anchors, which serve as alignment candidates. (d) Localization of Repetitive and Non-Repetitive Regions: opjMap employs a refined localization strategy that uses two distinct window sizes, \u003cem\u003el\u003c/em\u003e and \u003cem\u003epl\u003c/em\u003e, to identify repetitive and non-repetitive Regions. (e) Refined Alignment and Result Merging: The candidate regions identified in step (d) are further refined by pruning the anchor skeleton. A detailed alignment is then performed based on this skeleton. Finally, the alignment results for both the seeded and non-seeded regions are merged to produce a complete alignment result.\u003c/p\u003e\u003cp\u003e\u003cb\u003eReference Genome Indexing\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo facilitate rapid lookup, a hash-based index is constructed for the reference genome[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. This process involves extracting minimizers from the reference and storing each minimizer along with its corresponding position in a hash table[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eGeneration of Minimizer Anchor Graph\u003c/h3\u003e\n\u003cp\u003eFollowing index construction, minimizers are extracted from the read and are used to query the reference index. Each match is recorded as an anchor tuple \u003cem\u003em\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e = (\u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e), where \u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is the minimizer's position on the reference, \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is its position on the read, and \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e represents its orientation (0 for forward, 1 for reverse). For reverse-oriented minimizers (as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb ), the position \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e on the read is recalculated using a specific formula. Given the read \u003cem\u003er\u003c/em\u003e with length \u003cem\u003elen\u003c/em\u003e(\u003cem\u003er\u003c/em\u003e), the new position is computed using (1). The recalculated positions for reverse-oriented anchors are illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec.\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\left\\{ {\\begin{array}{*{20}{l}} {{y_i},}\u0026amp;{{d_i}=0} \\\\ {len(r) - {y_i},}\u0026amp;{{d_i}=1} \\end{array}} \\right.$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\n\u003ch3\u003eOrthogonal Projection and Voting\u003c/h3\u003e\n\u003cp\u003eAfter recalculating the positions of reverse-oriented anchors, all collinear skeleton anchors are positioned along a 45-degree upward-sloping line on the anchor graph. To facilitate the voting process, anchors are orthogonally projected onto a 45-degree downward-sloping line using (2). The projected anchors are recorded as \u003cem\u003eproj\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e = (\u003cem\u003eprojx\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eprojy\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e). After orthogonal projection, a linear skeleton approximates a single focal point. The difference in anchor count between windows containing a skeleton and those without one is significantly greater after orthogonal projection than it would be without it. This key characteristic allows us to effectively identify the skeleton-containing windows using non-overlapping windows. The projected line is then partitioned into windows of a fixed length \u003cem\u003el\u003c/em\u003e (default 1000 bp). The number of anchors falling into each window is counted, as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec, which forms the basis for the subsequent voting strategy.\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\left[ {\\begin{array}{*{20}{c}} {proj{x_i}} \\\\ {proj{y_i}} \\end{array}} \\right]=\\left[ {\\begin{array}{*{20}{c}} {\\frac{1}{2}}\u0026amp;{ - \\frac{1}{2}} \\\\ { - \\frac{1}{2}}\u0026amp;{\\frac{1}{2}} \\end{array}} \\right]\\left[ {\\begin{array}{*{20}{c}} {{x_i}} \\\\ {{y_i}} \\end{array}} \\right]$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eDetailed Alignment.\u003c/p\u003e\n\u003ch3\u003eLocalization of Repetitive and Non-Repetitive Regions\u003c/h3\u003e\n\u003cp\u003eopjMap employs a refined localization strategy that uses two distinct window sizes, \u003cem\u003el\u003c/em\u003e and \u003cem\u003epl\u003c/em\u003e (where \u003cem\u003ep\u003c/em\u003e is a hyperparameter), to identify different types of anchor regions, as show in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed. This process avoids the limitations of a standard sliding window by first ranking all windows on the projected line based on their anchor count. It then selects the top \u003cem\u003em\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e windows of size \u003cem\u003el\u003c/em\u003e as candidates for repeats across distinct reference region, which correspond to either non-repetitive alignments or interspersed repeats where the duplicated segment is external to the read's mapping region in reference. Subsequently, after removing these selected windows, the algorithm re-evaluates the remaining regions using a larger window size of \u003cem\u003epl\u003c/em\u003e. The top \u003cem\u003em\u003c/em\u003e\u003csub\u003e2\u003c/sub\u003e windows of this size are then chosen as candidates for repeats within a single reference region, which are characteristic of complex structures like segmental duplications. This two-stage, multi-size window selection process effectively distinguishes unique alignment regions from those associated with duplication structural variations.\u003c/p\u003e\n\u003ch3\u003eRefined the Alignment Skeleton and Detailed Alignment\u003c/h3\u003e\n\u003cp\u003eOrthogonal projection and voting are initially used to identify regions containing linear skeletons. Since these initial skeletons often contain noise and can be incomplete, we address this by employing a dynamic programming algorithm to further process these regions and construct a refined alignment skeleton[\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, this dynamic scoring algorithm can effectively: (a) remove irrelevant anchors (noise), (b) merge skeletons that span multiple windows to form a complete skeleton, and (c) construct skeletons for segmental repeats, which facilitates subsequent detailed\u003c/p\u003e\u003cp\u003eOverlapping skeleton structures can be complex, and with relatively short fragments in the read, the skeleton information is often sparse, which can compromise the alignment quality. Therefore, opjMap extracts shorter minimizers of length 9 from these regions to construct a more informative and complete skeleton for detailed alignment. After a high-quality\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003ealignment skeleton is obtained, opjMap extends it at both ends to ensure the completeness of the alignment region. Finally, a basic alignment algorithm is used for a detailed alignment of the non-seed regions[\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], and the results are merged with those of the seed regions to produce a complete alignment (as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ee).\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003eOverview\u003c/h2\u003e\n \u003cp\u003eTo evaluate the performance of opjMap, we conducted a comparative analysis against widely used long-read aligners: minimap2[\u003cspan class=\"CitationRef\"\u003e29\u003c/span\u003e], NGMLR[\u003cspan class=\"CitationRef\"\u003e30\u003c/span\u003e] and Winnowmap2[\u003cspan class=\"CitationRef\"\u003e31\u003c/span\u003e]. All alignment methods were tested on both simulated and real single-molecule sequencing datasets. The experiments were performed on a server running the Ubuntu 22.04 operating system, equipped with 189 GB of RAM and two Intel Xeon E5-2686 v4 processors (2.30 GHz, 16 cores, 32 threads each).\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eSimulated Data Experiments\u003c/h3\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003eAlignment Evaluation for Non-Structural Variation Reads\u003c/h2\u003e\n \u003cp\u003eEvaluating the alignment performance of different tools involved using PBSIM2[\u003cspan class=\"CitationRef\"\u003e32\u003c/span\u003e] to generate simulated reads with known reference positions, thereby enabling a precise comparison of alignment quality. We generated four sets of simulated reads with varying error rates: 10% and 15% to mimic the PacBio platform, and 20% and 30% to represent the ONT platform. The reads were generated from the chromosome 1 sequence of \u003cem\u003eH.sapiens\u003c/em\u003e. The commands used for generating these datasets are provided in supplementary file Table \u003cspan class=\"InternalRef\"\u003eS1\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003eGiven the inherent error rate of sequencing reads, a base is considered correctly aligned if its mapped position on the reference genome differs from its true simulated position by no more than \u003cem\u003ew\u003c/em\u003e bases (where \u003cem\u003ew\u003c/em\u003e\u0026thinsp;=\u0026thinsp;5). A read is considered correctly aligned if more than 90% of its bases are correctly mapped[\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e]. Base-level accuracy is defined as the ratio of correctly aligned bases to the total number of aligned bases[\u003cspan class=\"CitationRef\"\u003e34\u003c/span\u003e], while sensitivity is the ratio of correctly aligned bases to the total number of bases in the simulated dataset. Similarly, read-level accuracy is the ratio of correctly aligned reads to the total number of aligned reads, and sensitivity is the ratio of correctly aligned reads to the total number of reads in the simulated dataset. The specific commands used for aligning with each tool are provided in supplementary file Table S2. The resulting alignment data are presented in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e. The values in parentheses in the Accuracy and Sensitivity columns indicate the percentage difference relative to opjMap. For example, minimap2\u0026apos;s base-level accuracy at a 10% error rate is 95.40 (-0.13) %, where \u0026minus;\u0026thinsp;0.13% signifies that it is 0.13% lower than opjMap.\u003c/p\u003e\n \u003cp\u003eAt a simulated error rate of 10%, opjMap demonstrated higher sensitivity in both base-level and read-level alignments compared to all other tools, with the exception of minimap2, to which it was slightly inferior. For error rates of 15%, 20% and 30%, opjMap consistently exhibited superior sensitivity at both base and read levels compared to the other aligners. Although other tools achieved higher accuracy at the base level, their read-level alignment accuracy was consistently lower than that of opjMap. These results collectively suggest that the orthogonal projection-based opjMap offers high sensitivity under moderate to high error rate conditions, enabling it to align a greater number of bases and accurately map more reads.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n \u003ch2\u003eAlignment Evaluation for Duplications Across Distinct Reference Regions\u003c/h2\u003e\n \u003cp\u003eWe evaluated the tools\u0026apos; ability to detect interspersed repeats located outside the read\u0026apos;s corresponding gene. We generated sequences containing repeats using a custom script, randomly selecting the strand for each fragment. Unlike PBSIM, Badread[\u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e] can introduce sequencing errors into a short sequence, simulating its output under various error rates. Using Badread, we added sequencing errors to the fragments (see Supplementary Table S3 for specific commands) and then used a script to select simulated sequences with repetitive variations that met our criteria. Due to the random nature of the simulation, the number of reads in each dataset varied.\u003c/p\u003e\n \u003cp\u003eTo select an appropriate error rate for comparison, we first tested the sensitivity and accuracy of different methods for aligning variations in 1000 bp sequences. The results are shown in Supplementary Fig. \u003cspan class=\"InternalRef\"\u003eS1\u003c/span\u003e. opjMap demonstrated a significant lead in both accuracy and sensitivity under high error rates, with this gap only narrowing when the error rate\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eResults of different methods on simulated dataset\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"2\" style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eError Rate\u003cbr\u003e\u003c/strong\u003e\u003cstrong\u003e(\u003c/strong\u003e\u003cstrong\u003eNumber of Reads\u003c/strong\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd rowspan=\"2\" style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAlignment Tool\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"4\" style=\"width: 41%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBase Level\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e \u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"4\" style=\"width: 37%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRead Level\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eNumber of Alignments(M)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCorrect Alignments(M)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSensitivity (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e \u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eNumber of Alignments\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCorrect Alignments\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAccuracy (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eSensitivity (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e10%\u003cbr\u003e\u003c/strong\u003e\u003cstrong\u003e(\u003c/strong\u003e\u003cstrong\u003e241144\u003c/strong\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eopjMap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,347\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,242\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e95.53\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.94\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e217354\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e216563\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e99.64\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.81\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,350\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,242\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e95.40(-0.13)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.96(+0.02)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e217318\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e216660\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.70(-0.06)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.85(+0.04)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,273\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,231\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e98.14(+2.61)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.50(-0.44)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e215791\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e214798\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.54(-0.10)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.07(-0.74)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003engmlr\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,234\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,225\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.57(+4.04)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.25(-0.69)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e216443\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e214058\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e98.90(-0.74)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e88.77(-1.04)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e15%\u003cbr\u003e\u003c/strong\u003e\u003cstrong\u003e(\u003c/strong\u003e\u003cstrong\u003e239772\u003c/strong\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eopjMap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,320\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,232\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e96.22\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.55\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e215443\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e213962\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e99.31\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.24\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,313\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,212\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e95.61(-0.61)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e88.74(-0.81)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e214111\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e211273\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e98.67(-0.64)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e88.11(-1.13)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,112\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,064\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e97.72(+1.50)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e82.81(-6.74)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e198830\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e195513\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e98.33(-0.98)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e81.54(-7.70)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003engmlr\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,182\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,170\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.45(+3.23)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e87.07(-2.48)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e212526\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e205323\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e96.61(-2.70)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e85.63(-3.61)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e20%\u003cbr\u003e\u003c/strong\u003e\u003cstrong\u003e(\u003c/strong\u003e\u003cstrong\u003e127587\u003c/strong\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eopjMap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,322\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,233\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e96.14\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.58\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e114860\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e114074\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e99.32\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e89.41\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,326\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,233\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e96.03(-0.11)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.60(-0.34)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e114850\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e114045\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.30(-0.02)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.39(-0.02)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,224\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,205\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.17(+3.03)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e88.48(-1.46)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e113582\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e111987\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e98.60(-0.72)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e87.77(-1.64)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003engmlr\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,221\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,206\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.35(+3.21)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e88.52(-1.42)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e114446\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e111828\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e97.71(-1.61)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e87.65(-1.76)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e30%\u003cbr\u003e\u003c/strong\u003e\u003cstrong\u003e(\u003c/strong\u003e\u003cstrong\u003e118334\u003c/strong\u003e\u003cstrong\u003e)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eopjMap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,258\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2,203\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e97.57\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e88.40\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e105809\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e104154\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e98.44\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e88.02\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,216\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,142\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e96.64(-0.93)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e85.93(-2.47)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e103884\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e98286\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e94.61(-3.83)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e83.06(-4.96)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e1,386\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e1,357\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e97.94(+0.37)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e54.46(-33.94)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\u0026nbsp;\u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e69471\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e57197\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e82.33(-16.11)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e48.34(-39.68)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003engmlr\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,052\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 11%;\"\u003e\n \u003cp\u003e2,036\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e99.25(+1.68)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e81.70(-6.70)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 2%;\"\u003e\n \u003cp\u003e \u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e101117\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 8%;\"\u003e\n \u003cp\u003e90927\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e89.92(-8.52)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 9%;\"\u003e\n \u003cp\u003e76.84(-11.18)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003edropped to between 15% and 10%. We chose an error rate of 15% to test the alignment of repeats of different lengths. For this test, we set five different lengths for external repetitive variations: 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp.\u003c/p\u003e\n \u003cp\u003eThe average sequence length was 10,000 bp, with 3,000 sequences in each length group. For more detailed information on these two sets of reads with different error rates and lengths, please refer to Supplementary Tables S4 and S5.\u003c/p\u003e\n \u003cp\u003eDue to the presence of base-level errors in sequencing reads, the position of repeat variations within the reads is affected. The experiment determined whether a repeat variation was detected by verifying if the read\u0026apos;s corresponding variation position on the reference genome was aligned multiple times[\u003cspan class=\"CitationRef\"\u003e36\u003c/span\u003e]. Specifically, we defined the position and orientation of an aligned read on the reference genome as \u003cem\u003eG\u003c/em\u003e = (\u003cem\u003eG\u003c/em\u003e\u003csub\u003e\u003cem\u003est\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eG\u003c/em\u003e\u003csub\u003e\u003cem\u003eed\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eG\u003c/em\u003e\u003csub\u003e\u003cem\u003ed\u003c/em\u003e\u003c/sub\u003e), and the true position of the variation on the reference as \u003cem\u003eT\u003c/em\u003e = (\u003cem\u003eT\u003c/em\u003e\u003csub\u003e\u003cem\u003est\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eT\u003c/em\u003e\u003csub\u003e\u003cem\u003eed\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eT\u003c/em\u003e\u003csub\u003e\u003cem\u003ed\u003c/em\u003e\u003c/sub\u003e). If the number of non-empty intersections between \u003cem\u003eG\u003c/em\u003e and \u003cem\u003eT\u003c/em\u003e was greater than or equal to (\u003cem\u003en\u003c/em\u003e\u0026thinsp;+\u0026thinsp;1), and the alignment orientation was identical, where \u003cem\u003en\u003c/em\u003e is the number of repeats (\u003cem\u003en\u003c/em\u003e\u0026thinsp;=\u0026thinsp;1), the alignment was considered correct.\u003c/p\u003e\n \u003cp\u003eFigure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e illustrates the accuracy and sensitivity of this detection at different lengths (for specific commands, see Supplementary Table S6, and for numerical values, see Supplementary Table S7). As the length of the repetitive variation region increases, the accuracy and sensitivity of the alignment tools also increase. Because the simulated repeat sequences were all fragments extracted from the reference genome, and their length was around 10,000 bp, most alignment tools were able to align the entire sequence. This resulted in the sensitivity and accuracy of many results being identical. Throughout the length-based experiments, opjMap consistently outperformed other tools, achieving 100% sensitivity and accuracy in detecting repetitive variations when the repeat length was 5000 bp.\u003c/p\u003e\n \u003cp\u003eOverall, opjMap maintained high accuracy and sensitivity in detecting duplications across distinct reference regions, regardless of variations in length or error rate. This indicates that opjMap is capable of identifying a greater number of inter-regional repetitive variations, even under conditions of high error rates and short variation lengths.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003eAlignment Evaluation for Duplications within a Single Reference Region\u003c/h2\u003e\n \u003cp\u003eThe experiments also tested the detection of repetitive regions located within the reads, with two distinct types of variations: interspersed repeats with a single duplication event and contiguous segmental duplications with multiple repeats.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n \u003ch2\u003eSingle Duplication Event\u003c/h2\u003e\n \u003cp\u003eWe evaluated the tools\u0026apos; detection capabilities by fixing the repeat fragment length at 1000 bp and varying the sequencing error rates. The detailed results are shown in Supplementary Fig S2. At high sequencing error rates, opjMap maintained a high level of performance. As the error rate decreased, the alignment results of other tools began to approach those of opjMap. We then chose an error rate of 15% for the subsequent experiment, which was designed to test the alignment of repeat fragments of different lengths. For this, we generated sequences with this error rate, containing internal repeats of 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp. Detailed information on these two datasets with varying error rates and lengths can be found in Supplementary Tables S8, S9.\u003c/p\u003e\n \u003cp\u003eFigure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e illustrate the accuracy and sensitivity at different fragment lengths, with specific numerical values available in Supplementary Table S10. opjMap showed higher alignment sensitivity and accuracy when the repeat fragments were short. As the length of the repeat fragments increased, the performance of other tools approached that of opjMap. This indicates that opjMap is suitable for detecting interspersed repeats in a wide range of scenarios.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\n \u003ch2\u003eContiguous Segmental Duplication\u003c/h2\u003e\n \u003cp\u003eFor the comparison of segmental duplication detection, a custom script was used to generate sequence fragments of five different lengths (100 bp, 250 bp, 500 bp, 750 bp, and 1000 bp), with each length repeated 10 times. We then introduced sequencing errors at a rate of 15% using the Badread tool. Detailed read information can be found in Supplementary Table S11. Given the high number of repeats, we fixed the fragment length at 1000 bp and initially tested the sensitivity for repeat judgment thresholds (\u003cem\u003en\u003c/em\u003e) of 3, 5, 7, and 10. The results are shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e, with specific values available in Supplementary Table S12. From these results, it can be seen that Winnowmap2 is not well-suited for aligning segmental repeats. In contrast, opjMap maintained high sensitivity as the threshold \u003cem\u003en\u003c/em\u003e increased.\u003c/p\u003e\n \u003cp\u003eA repeat judgment threshold (\u003cem\u003en\u003c/em\u003e) of 10 was selected to evaluate the performance of alignment tools on repeats. The results are shown in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e, with specific numerical values available in Supplementary Table S13. As the figure illustrates, opjMap surpassed the other aligners in both accuracy and sensitivity for detecting segmental duplications. opjMap achieves this by extracting shorter sub-fragments of length 9 from overlapping regions. This demonstrates that constructing the alignment skeleton with shorter fragment information can effectively enhance the detection of repeat region information.\u003c/p\u003e\n \u003cp\u003eFigure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003e presents a comparison of opjMap with three other tools, visualized using the IGV alignment visualization tool. Figure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003ea shows the true skeleton anchor graph for a segmental tandem repeat of 500 bp, repeated 10 times. From this, it can be seen that 6 of the repeats are in the forward direction and 4 are in the reverse direction. Figure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003eb shows the alignment results for this sequence from all four tools. We can observe that opjMap successfully aligned 4 reverse-oriented and 5 forward-oriented segmental repeats. In comparison, NGMLR aligned 2 reverse-oriented and 4 forward-oriented repeats. Winnowmap2 failed to recognize this repeat region, and minimap2 produced only a small number of alignment results. These findings demonstrate that opjMap is capable of identifying a greater number of segmental repeats, yielding more comprehensive alignment results. This indicates that opjMap possesses a superior ability to align segmental repeat variations.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\n \u003ch2\u003eReal Data Experiments\u003c/h2\u003e\n \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e\n \u003ch2\u003eEvaluation on Datasets Without Segmental Repeats\u003c/h2\u003e\n \u003cp\u003eA comparison of alignment performance on real-world datasets was conducted using sequencing data from two platforms: PacBio and ONT. The PacBio dataset, from \u003cem\u003eA.thaliana\u003c/em\u003e, contained 300,000 sequences, while the ONT dataset, from \u003cem\u003eE.coli\u003c/em\u003e, contained 60,000 sequences. All experiments were run using 64 threads, and the alignment results are presented in the table below. As shown in the Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, opjMap aligns a greater number of bases and reads on both the PacBio and ONT platforms while maintaining a lower consumption of computational resources. minimap2\u0026apos;s performance is close to opjMap\u0026apos;s, whereas NGMLR consumes significantly more resources.\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003cdiv class=\"colspec\" align=\"char\"\u003e\u0026nbsp;\u003c/div\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eResults of different methods on real dataset.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDataSet\u003c/p\u003e\n \u003cp\u003e(Read number)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAligner\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMapped bases\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMapped reads\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eCPU time\u003c/p\u003e\n \u003cp\u003e(seconds)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eWall time\u003c/p\u003e\n \u003cp\u003e(seconds)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003ePeak Memory\u003c/p\u003e\n \u003cp\u003e(GB)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" align=\"left\"\u003e\n \u003cp\u003ePacBio\u003c/p\u003e\n \u003cp\u003e(304718)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eopjMap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e5492704427\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e292604\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e48032\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e990\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e26.3\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e5456246662\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e290099\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e67325\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1150\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e25.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e5251174215\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e280013\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e80924\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1353\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e40.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNGMLR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e4362237072\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e255632\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e321851\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e5134\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e39.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd rowspan=\"4\" align=\"left\"\u003e\n \u003cp\u003eONT\u003c/p\u003e\n \u003cp\u003e(62094)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003eopjMap\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e413018134\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e53917\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e691\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e15\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e\u003cstrong\u003e13.8\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e412818972\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e53665\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e495\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e15.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e403409572\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e52908\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e1119\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e28.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNGMLR\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e365331379\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e49943\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e19117\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e395\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e39.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\n \u003ch2\u003eEvaluation on Datasets With Segmental Repeats\u003c/h2\u003e\n \u003cp\u003eTo compare the performance of different alignment tools on segmental repeat variations in real-world sequencing data, we used long-read sequencing datasets from the human genomes T2T-CHM13 and HG002[\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e]. T2T-CHM13, considered the first complete and gapless human reference genome, serves as an ideal benchmark for evaluating and improving genomic alignment and variant calling algorithms. The HG002 dataset, on the other hand, consists of high-quality sequencing data from a real human sample. As existing structural variation benchmark sets lack sufficient information on segmental repetitive variations, we programmatically inserted 2,300 segmental repeat sequences into the T2T-CHM13 reference genome at regions corresponding to the original reads. The length distribution is shown in Supplementary Fig. S3.\u0026nbsp;\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eComparison of Mappers for Segmental Repeat Detection on a Reference Genome\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAligner\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eopjMap\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eminimap2\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eNGMLR\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eWinnowmap2\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTotal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e2300\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2297\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1705\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2140\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCorrect\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e1893\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1878\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1450\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAcc (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e82.3%\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.76%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e2.58%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e67.46%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSen (%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e82.3%\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e81.65%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1.91%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e63.04%\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eAs shown in the Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e, opjMap achieved both an accuracy and sensitivity of 82.3%, outperforming all other alignment tools. minimap2 followed closely behind, while both NGMLR and Winnowmap2 performed poorly in aligning segmental repetitive variations. Notably, segmental repeats occurring within the reference genome are more challenging to detect than those in the reads. Due to its orthogonal projection-based approach, opjMap exhibits higher sensitivity when dealing with a reference genome containing segmental repeats, allowing it to identify a greater number of variations.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eAlignment of repetitive structural variations in long reads with high error rates presents a significant challenge. When aligning such reads to a reference genome, the high error rate often leads to overlapping alignment skeletons, which many existing tools struggle to handle effectively. To overcome this issue, we propose opjMap, an alignment tool based on orthogonal projection. opjMap projects the linear alignment skeleton onto a straight line, enabling highly sensitive localization of the skeleton. This method allows opjMap to identify a greater number of reads on the reference genome. After locating the skeleton, opjMap extracts shorter minimizers from the repetitive regions to gather more detailed alignment information, thereby aligning a greater number of bases and improving overall alignment quality.\u003c/p\u003e\u003cp\u003eopjMap achieves high localization sensitivity while maintaining a low computational complexity. Unlike dynamic programming algorithms, which perform scoring and backtracking on window anchors to select collinear seeds\u0026mdash;with an optimized time complexity approaching \u003cem\u003eO\u003c/em\u003e(\u003cem\u003en\u003c/em\u003elog\u003cem\u003en\u003c/em\u003e), where \u003cem\u003en\u003c/em\u003e is the number of anchors\u0026mdash;opjMap's approach is more efficient. Because the number of windows is significantly smaller than the number of anchors, our method primarily focuses on projecting and counting each anchor, resulting in a time complexity closer to \u003cem\u003eO\u003c/em\u003e(\u003cem\u003en\u003c/em\u003e). After the projection and voting step, opjMap utilizes radix sort to count the anchors within each window, selecting windows with a high vote count as alignment candidates.\u003c/p\u003e\u003cp\u003eHowever, due to sequencing errors, two linear alignment skeletons within a read can become misaligned, which might lead to them being incorrectly projected into separate windows, thereby reducing read alignment sensitivity. To mitigate this issue, opjMap's projection process strategically increases the window length to place these misaligned skeletons within a single window. While this approach enhances read detection sensitivity, it can make it challenging to identify the specific structural variation information within the window, thus lowering the sensitivity for detecting internal variations. In future work, we plan to develop targeted processing methods for the alignment skeletons within these voted windows to further improve the sensitivity of structural variation alignment.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eIn this work, we propose a novel orthogonal projection-based voting localization method. This approach effectively avoids introducing excessive noise during the candidate region selection process, thereby satisfying the requirement for selecting collinear seeds. The method significantly reduces computational time complexity, and its use of orthogonal projection effectively filters out noise, which is beneficial for subsequent skeleton construction and detailed alignment. Experimental results demonstrate that our method can align a greater number of reads and bases under moderate-to-high sequencing error rates. Furthermore, it is also capable of aligning a higher number of repetitive variations, confirming its robustness and effectiveness.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eSMS\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003esingle-molecule sequencing\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eSMRT\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eSingle Molecule Real-Time\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eONT\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eOxford Nanopore Technologies\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eSVs\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eStructural variations\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eRHT\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eRegional hash table\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eFM-index\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eFull-text Minute-space index\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and material\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll data in this paper is available in the supplementary file or from the corresponding author on a reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work was supported in part by the Scientific Research General Project of Wuhan Technology And Business University under Grant A2025044 and was also supported by the Special Fund of Advantageous and Characteristic Disciplines (Group) of Hubei Province.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used in this study, along with the corresponding reference genomes, are publicly available from the NCBI and EBI repositories.\u003c/p\u003e\n\u003cp\u003eReal Datasets: Raw reads from Escherichia coli (ONT platform), Arabidopsis thaliana (PacBio platform), and Homo sapiens (PacBio platform) were obtained from the following sources:\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eE. coli\u003c/em\u003e:\u0026nbsp;https://www.ncbi.nlm.nih.gov/sra/?term=SRR34757056%2F\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eA. thaliana\u003c/em\u003e:\u0026nbsp;https://www.ncbi.nlm.nih.gov/sra/?term=ERR15092965\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eH. sapiens\u003c/em\u003e:\u0026nbsp;https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/reads/\u003c/p\u003e\n\u003cp\u003eReference Genomes: The reference genomes for \u003cem\u003eE. coli\u003c/em\u003e, \u003cem\u003eA. thaliana\u003c/em\u003e, and \u003cem\u003eH. sapiens\u003c/em\u003e can be accessed through these links:\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eE. coli\u003c/em\u003e:\u0026nbsp;https://www.ebi.ac.uk/ena/browser/view/ERX987748\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eA. thaliana\u003c/em\u003e:\u0026nbsp;https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001735.4/\u003c/p\u003e\n\u003cp\u003e\u003cem\u003eH. sapiens\u003c/em\u003e:\u0026nbsp;https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors’ contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eXing-Guo Fan developed the programming of the alignment tool and drafted the manuscript. Xiao-Dan Zhang carried out the revision of the manuscript. Cheng-Song Hu conducted the analysis of the experimental results. Jie-Jie Zeng and Shu-Rui Li executed the testing of the tool. Ze-Gang Wei provided the reference genome, reads, and computational infrastructure. All authors contributed to the conception and design of the study, discussed the results, and read, edited, and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBeran P, et al. KEC: unique sequence search by k-mer exclusion. Bioinf (Oxford England). 2021;37(19):btab196.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCharalampous T, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37(7):783\u0026ndash;92.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei Z-G, Zhang S-W. DMclust, a density-based modularity method for accurate OTU picking of 16S rRNA sequences. Mol Inf. 2017;36(12):1600059.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei Z-G, Zhang S-W. MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs. Mol BioSyst. 2015;11(7):1907\u0026ndash;13.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSmith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008;9(1):128.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHedges DJ, et al. Evidence of novel fine-scale structural variation at autism spectrum disorder candidate loci. Mol autism. 2012;3:1\u0026ndash;11.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePan B, et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics. 2019;20:17\u0026ndash;29.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFerragina P, Manzini G. Opportunistic data structures with applications. In: Symposium on Foundations of Computer Science; 2000.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010;11(5):473.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang H, et al. Fast and efficient short read mapping based on a succinct hash index. BMC Bioinformatics. 2018;19(1):92.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKaur H, Chand L. Biological sequence alignment using varied optimization algorithms. International Conference on Inventive Computation Technologies. Berlin: Springer; 2016. pp. 1\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eXu X et al. SLPal: Accelerating long sequence alignment on many-core and multi-core architectures. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020: pp. 2242\u0026ndash;2249.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePoerba YS, Martanti D. Genetic variability of Amorphophallus muelleri Blume in Java based on random amplified polymorphic DNA. Biodiversitas J Biol Divers, 2008. 9(4).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSavage DG, et al. Clinical features at diagnosis in 430 patients with chronic myeloid leukaemia seen at a referral centre over a 16-year period. Br J Haematol. 1997;96(1):111\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eIvan S, et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun. 2016;7:11307.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094\u0026ndash;100.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSedlazeck FJ et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods, 2018. 15(6).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei ZG, et al. kngMap: sensitive and fast mapping algorithm for noisy long reads based on the K-Mer neighborhood graph. Front Genet. 2022;13:890651.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu B, et al. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics. 2016;32(11):1625\u0026ndash;31.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHaghshenas E, et al. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics. 2019;35(1):20\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLippert RA. Space-efficient whole genome comparisons with Burrows\u0026ndash;Wheeler transforms. J Comput Biol. 2005;12(4):407\u0026ndash;15.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTakahashi KK, Innan H. Duplication with structural modification through extrachromosomal circular and lariat DNA in the human genome. Sci Rep. 2020;10(1):7150.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRasko DA, et al. Origins of the E. coli strain causing an outbreak of hemolytic\u0026ndash;uremic syndrome in Germany. N Engl J Med. 2011;365(8):709\u0026ndash;17.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMurray IA, et al. The methylomes of six bacteria. Nucleic Acids Res. 2012;40(22):11450\u0026ndash;62.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNing Z, et al. SSAHA: a fast search method for large DNA databases. Genome Res. 2001;11(10):1725\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRoberts M, et al. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu B, et al. deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics. 2016;32(21):3224\u0026ndash;32.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei ZG, et al. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics. 2023;39(12):btad726.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094\u0026ndash;100.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSedlazeck FJ et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods, 2018. 15(6).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJain C, Rhie A, Hansen NF et al. Long-read mapping to repetitive reference sequences using Winnowmap2. 2022; 19:705\u0026ndash;10.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOno Y, Asai K, Hamada MJB. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. 2020.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei Z-G, Zhang S-W. NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinformatics. 2018;19(1):177.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei Z-G, Zhang S-W, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics. 2020;21(1):341.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWick RR. Badread: simulation of error-prone long reads. J Open Source Softw. 2019;4(36):1316.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWei ZG, et al. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics. 2023;39(12):btad726.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMitchell R, Vollger, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376:eabj6965.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-bioinformatics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"binf","sideBox":"Learn more about [BMC Bioinformatics](http://bmcbioinformatics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/binf","title":"BMC Bioinformatics","twitterHandle":"@BMC_Bioinformatics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"High error rate, long-read alignment, orthogonal projection, repetitive variations, segmental duplication","lastPublishedDoi":"10.21203/rs.3.rs-7929852/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7929852/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cb\u003eBackground\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe continuous advancements in single-molecule sequencing (SMS) technologies, including PacBio Single Molecule Real-Time and Oxford Nanopore Technologies (ONT), have led to a significant increase in read lengths. This has unlocked tremendous potential for a wide range of cutting-edge genomic applications. However, these long reads suffer from higher sequencing error rates and contain repetitive segments, making it challenging for most existing alignment tools to effectively map these repetitive regions. Given the crucial role that repetitive variations play in biological evolution, we introduce opjMap, an alignment tool based on orthogonal projection localization, which is specifically designed to align long, noisy SMS reads to a reference sequence while also accommodating repetitive structural variations (SVs).\u003c/p\u003e\u003cp\u003e\u003cb\u003eResults\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThrough exhaustive benchmark experiments on both simulated and real SMS datasets, we demonstrate that opjMap exhibits higher sensitivity compared to other mainstream alignment tools like minimap2, NGMLR, and Winnowmap2, enabling it to align more reads and bases to the reference genome. Furthermore, opjMap produces a greater number of alignment results under challenging conditions of high error rates and short repetitive segments.\u003c/p\u003e\u003cp\u003e\u003cb\u003eConclusions\u003c/b\u003e\u003c/p\u003e\u003cp\u003eopjMap provides a robust and highly sensitive solution for mapping noisy long reads containing repetitive structural variations. opjMap supports multi-threaded alignment. The source code is publicly available for download at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/FanXingGuo/opjMap\u003c/span\u003e\u003cspan address=\"https://github.com/FanXingGuo/opjMap\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e","manuscriptTitle":"opjMap: A Sensitive Mapper for Repetitive Structural Variations in Long Noisy Reads Based on Orthogonal Projection","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-26 09:19:42","doi":"10.21203/rs.3.rs-7929852/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-01-05T11:11:36+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-27T15:23:14+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-20T23:31:38+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"14008569880530467223666365146885708969","date":"2025-11-26T16:35:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"4826591226951823462676112431879219944","date":"2025-11-18T11:38:06+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-11-17T12:32:54+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-06T13:05:46+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-11-06T11:30:29+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-11-06T05:00:00+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Bioinformatics","date":"2025-11-06T04:56:44+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-bioinformatics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"binf","sideBox":"Learn more about [BMC Bioinformatics](http://bmcbioinformatics.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/binf","title":"BMC Bioinformatics","twitterHandle":"@BMC_Bioinformatics","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"1c74ed1b-8ae5-400b-a6d3-f4618fd291ed","owner":[],"postedDate":"November 26th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"in-revision","subjectAreas":[],"tags":[],"updatedAt":"2026-01-05T11:23:44+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-26 09:19:42","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7929852","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7929852","identity":"rs-7929852","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.