{"paper_id":"8658db99-392b-4d64-b420-8041f5b8170a","body_text":"Efficient Full-Length RNA Isoform Reconstruction with ISAtools | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Efficient Full-Length RNA Isoform Reconstruction with ISAtools Zhuo-Xing Shi, Hu Chen, Qi Dai This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7019918/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Accurate identification and quantification of full-length RNA isoforms remain challenging in long-read RNA sequencing due to sequencing errors, complex splicing, and incomplete annotations. We present ISAtools, a sequencing data-driven framework that leverages weakly supervised static references to reconstruct and quantify full-length isoforms, including their splicing structures and transcript boundaries. Benchmarking on simulated, SIRV, and biological datasets shows that ISAtools achieves high accuracy across varying sequencing depths, annotation completeness, and transcriptomic complexity, while maintaining fast runtime and low memory usage. These results demonstrate that ISAtools enables efficient and accurate identification and quantification of full-length RNA isoforms from high-throughput long-read RNA sequencing. Biological sciences/Computational biology and bioinformatics/Software Biological sciences/Genetics/Genomics/Transcriptomics Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Introduction Long-read RNA sequencing enables the direct capture of full-length cDNA sequences, offering a powerful platform for transcript isoform analysis across diverse biological contexts 1 – 5 . Compared to short-read sequencing, long-read platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) generate reads that span entire transcripts, facilitating isoform reconstruction with fewer computational assumptions 6 – 8 . Despite these advantages, accurate isoform identification remains challenging due to cDNA fragmentation, sequencing errors and artifacts introduced during library construction 6 , 9 . Benchmark studies, including those from the Long-read RNA-seq Genome Annotation Assessment Project (LRGASP), have shown that PacBio outperforms ONT in isoform reconstruction accuracy, largely due to higher sequencing fidelity 1 , 10 – 15 . The introduction of the PacBio Revio system and Super Accuracy (SPRQ) chemistry further improves throughput and base-call accuracy, making it feasible to generate high-depth long-read transcriptomes at scale 16 . These advances, however, demand correspondingly robust and scalable computational frameworks to fully realize the potential of long-read technologies. Multiple tools have been developed for isoform reconstruction from long-read data, each using distinct algorithmic principles. For example, IsoQuant constructs intron graphs from spliced alignments and corrects alignment errors to identify novel isoforms, particularly in high-error ONT datasets 17 . Bambu infers junction chains, models novel discovery rates to build context-specific references, and uses an expectation-maximization framework to quantify isoform expression 18 . Mandalorion clusters reads based on junction chains and applies consensus generation and filtering to reconstruct high-confidence isoforms 19 . Despite these efforts, several challenges remain. Existing tools often depend heavily on static annotations or assume complete references, limiting their robustness across diverse transcriptomes. Moreover, accurately identifying transcription start and end sites (TSS and TES)-a critical step for defining truly full-length isoforms–remains underexplored 1 . This issue is especially pressing in high-depth datasets, where scalability and boundary resolution are essential. To overcome these challenges, we developed ISAtools (Isoform Sequencing Analysis tools), a data-driven computational framework for reconstructing and quantifying full-length RNA isoforms from high-throughput long-read RNA-seq data. ISAtools introduces the Splice Site Chain (SSC), a unified text-based representation that jointly encodes transcript annotations and read alignments, enabling efficient data preprocessing and error correction. This abstraction streamlines data preprocessing and enables efficient error correction targeting high-impact misclassifications, including incomplete splice matches (ISM), novel not in catalog (NNC), and novel in catalog (NIC) isoforms. To improve transcript boundary resolution, ISAtools incorporates a coupled transcript boundary selection model, which clusters genomic coordinates of read termini to accurately define TSS and TES 20 . This approach enhances isoform completeness by directly leveraging read-end distributions, bypassing the limitations of annotation-dependent models. By integrating SSC structure with TSS and TES localization, ISAtools enables comprehensive and annotation-flexible isoform reconstruction. To further benchmark ISAtools, we established a multi-mode evaluation system and performed systematic benchmarking on simulated datasets, SIRV spike-ins, and real samples sequencing data. We designed three annotation scenarios–Full Annotation Reference (FullAnno), Reduced Reference (ReduceAnno), and No Annotation (NoAnno)–to comprehensively assess performance under varying levels of annotation support and sequencing complexity. Evaluation results indicate that ISAtools, powered by its data-driven architecture, demonstrates superior robustness, particularly in novel isoform discovery, boundary detection, and runtime performance. These results establish ISAtools as a robust and scalable solution for full-length transcriptome analysis in long-read RNA sequencing, with broad utility for isoform-level transcriptomics. Result Study design and ISAtools framework To benchmark ISAtools against leading long-read RNA-seq analysis tools, we established a comprehensive evaluation framework encompassing dataset construction, metric definition, and performance profiling across multiple analytical dimensions. We assembled datasets from both simulated and real samples. The simulated dataset included human and mouse transcriptomes across two representative tissues (brain and heart), each with two biological replicates. Subsampling at four sequencing depths (3M, 6M, 12M, and 20M reads) yielded 32 distinct datasets. Real-world evaluation used the publicly available PacBio Universal Human Reference RNA (UHRR) dataset, along with a synthetic SIRV (Spike-In RNA Variant) dataset serving as a complex ground truth reference (Fig. 1 a). Performance evaluation was organized into three layers (Fig. 1 b): (1) Transcript structure analysis, based on precision, recall, and F1-score for isoform reconstruction, novel transcript detection, and TSS/TES identification; (2) Isoform quantification, assessed using Pearson and Spearman correlations, Concordance Correlation Coefficient (CCC), and normalized root mean square error (NRMSE); and (3) Computational efficiency, evaluated by runtime and memory usage under identical hardware conditions. This multi-dimensional design supports robust and reproducible comparisons across tools. SSC classification based on SQANTI 9 , 21 revealed that the majority of erroneous events in simulated, SIRV, and real datasets were ISM, NIC, or NNC types (Supplementary Fig. 1). Guided by this, we developed ISAtools, a modular pipeline optimized for reconstructing and quantifying full-length isoforms. The framework comprises six core modules (Fig. 1 c, See Methods for details): (1) SSC extraction, (2) SSC preprocessing, (3) SSC polishing, (4) SSC filtering, (5) TSS/TES detection, and (6) isoform quantification. ISAtools supports both annotation-guided and de novo modes, scales to sequencing depths from 1M to > 100M reads, and accepts BAM/SAM inputs. The SSC analysis pipeline follows a cascaded, multi-stage architecture. The extraction module parses CIGAR strings to define splice junctions and donor-acceptor motifs, outputting both read-level and unique SSC-level records. The preprocessing module filters SSCs by read identity, coverage, and support. The polishing module performs gene-level aggregation, refines junctions using consensus-based correction, and removes fragmented or low-confidence structures. The filtering module targets ISM, NIC, and NNC artifacts to preserve high-confidence isoforms. When reference annotations are provided, a dedicated Rescue & Filtering module reintroduces conserved low-abundance SSCs, enhancing isoform completeness. Benchmarking SSC detection under diverse annotation conditions To assess SSC detection, we benchmarked ISAtools against IsoQuant, Bambu, and Mandalorion across 32 simulated datasets and one SIRV dataset (Fig. 1 a). To emulate incomplete annotations, we generated “Reduced Annotation” versions by randomly removing 20% or 40% of transcripts while ensuring one transcript per gene remained, yielding four evaluation scenarios: NoAnno, ReduceAnno20, ReduceAnno40, and FullAnno. A total of 512 simulated and 16 SIRV test cases were analyzed. ISAtools consistently achieved high SSC precision and recall across depths and annotation regimes. Performance gains were most evident under NoAnno conditions, where the absence of reference annotations caused all tools to decline, though ISAtools remained the most stable (Fig. 2 a; Supplementary Fig. 2). Human transcriptomes exhibited greater structural complexity than mouse datasets (Supplementary Tables 1–2), amplifying performance differentials (Fig. 2 a; Supplementary Fig. 2). Notably, Mandalorion showed the largest drop under NoAnno at low depth (3M reads), indicating high sensitivity to coverage. To evaluate read-level support, we examined the fraction of SSCs fully backed by aligned reads. ISAtools and Mandalorion both exceeded 99.5% support across conditions (Fig. 2 b), whereas IsoQuant and Bambu exhibited slightly lower rates (96.0-99.3% and 98.0-99.2%, respectively). We next assessed novel isoform recovery by comparing SSC detection in the ReduceAnno20 and ReduceAnno40 conditions. ISAtools outperformed all tools across depths and annotation incompleteness, particularly in human datasets (Fig. 2 c; Supplementary Fig. 3). F1-scores ranged from 0.650–0.872 for ISAtools, surpassing IsoQuant (0.638–0.840), Mandalorion (0.580–0.869), and Bambu (0.558–0.733). In the complex SIRV dataset (Supplementary Table 3), ISAtools matched IsoQuant and Bambu under FullAnno but showed marked superiority under ReduceAnno and NoAnno conditions (Fig. 2 d), highlighting its robustness in unannotated contexts. Benchmarking TSS/TES identification performance Although long-read sequencing has improved splice junction detection, accurate identification of TSS/TES remains less well explored. Recent findings suggest TSS-TES pairing is globally coordinated and cell-type dependent 20 , underscoring the need for flexible, data-driven models over static annotations. To address this, ISAtools employs a DBSCAN-based clustering module to jointly infer 5' and 3' transcript ends from read termini (Fig. 3 a). Using the SSC-anchored reads as input, we benchmarked ISAtools against IsoQuant, Bambu, and Mandalorion, generating 512 simulated and 16 SIRV evaluations. ISAtools consistently delivered high F1-scores across annotation conditions (FullAnno, ReduceAnno, NoAnno), ranging from 0.652 to 0.979 (Fig. 3 b; Supplementary Fig. 4). In mouse datasets, F1-scores ranged from 0.849 to 0.979, while human datasets yielded an average F1-score of 0.914, despite increased transcriptomic complexity. Mandalorion exhibited moderate performance (F1 = 0.563–0.928), with notable declines under low-depth and high-complexity conditions. By comparison, IsoQuant and Bambu showed stronger annotation dependence, with broader performance ranges (F1 = 0.597–0.987 and 0.577–0.995, respectively). We further evaluated the proportion of TSS/TES predictions supported by read termini (± 50 bp). ISAtools and Mandalorion both exceeded 99.5% concordance across conditions (Fig. 3 c), outperforming IsoQuant (92.9–99.1%) and Bambu (90.3–99.1%). To test annotation dependency, we extended reference TSS/TES by ± 100 bp and re-evaluated all tools. ISAtools and Mandalorion maintained stable performance, whereas IsoQuant and Bambu declined (Fig. 3 d), indicating potential overfitting to reference boundaries. This trend was reinforced in the SIRV dataset, where ISAtools consistently led in boundary accuracy (Fig. 3 e). Benchmarking Isoform Quantification Performance across Diverse Annotation Contexts Using the 512 simulated datasets generated from SSC-identifiable reads, we systematically evaluated transcript quantification accuracy across four tools: ISAtools, IsoQuant, Bambu, and Mandalorion. We employed multiple metrics-including Pearson correlation, Spearman correlation, Concordance Correlation Coefficient (CCC), and Normalized Root Mean Square Error (NRMSE)-to assess both the accuracy and robustness of quantification across varying transcriptomic complexities and annotation regimes. As shown in Figs. 4 a-b, ISAtools consistently achieved high quantification accuracy across all conditions. CCC values ranged from 0.986 to 0.994, while NRMSE remained low (0.041–0.085), indicating minimal estimation bias and high consistency. IsoQuant performed comparably under both full- and no-annotation conditions (CCC: 0.985–0.995; NRMSE: 0.045–0.084), but exhibited a slight performance drop under reduced annotations (CCC: 0.985–0.993; NRMSE: 0.053–0.095), although still outperforming Mandalorion and Bambu in most cases. Mandalorion displayed strong trend-tracking ability (Pearson = 0.991–0.996) but slightly lower CCC values (0.979–0.991), suggesting reduced concordance in absolute expression estimates. Bambu showed moderate performance under full and no annotation (CCC: 0.985–0.995; NRMSE: 0.039–0.091), but quantification degraded under reduced annotation (CCC: 0.971–0.984; NRMSE: 0.065–0.153), indicating greater annotation dependence. In rank-based correlation analysis, Mandalorion achieved the highest Pearson and Spearman correlations (Pearson: 0.991–0.996; Spearman: 0.979–0.991), followed closely by ISAtools (Pearson: 0.988–0.995; Spearman: 0.970–0.993) and IsoQuant (Pearson: 0.985–0.995; Spearman: 0.977–0.995). Bambu showed greater variability (Pearson: 0.971–0.995; Spearman: 0.962–0.993), especially under incomplete annotations. Isoform reconstruction performance on real biological samples To assess ISAtools under realistic conditions, we benchmarked performance on the PacBio UHRR dataset, which includes six technical replicates sequenced using the Kinnex full-length protocol and Revio SPRQ chemistry, producing ~ 80 million high-quality cDNA reads. Given the absence of ground-truth isoforms, we evaluated: (1) concordance with GENCODE annotations, (2) empirical read support, and (3) boundary-level accuracy. First, we performed comparative analysis of isoforms detected by each tool against GENCODE annotation using SQANTI3 21 . Under the NoAnno condition, ISAtools detected a mean of 69,504 isoforms per sample, with 46.8% classified as FSM (full splice match). Mandalorion recovered a similar number (70,359), with a higher FSM proportion (53.3%). IsoQuant detected fewer isoforms (60,200) and a lower FSM rate (37.7%), while Bambu produced the fewest isoforms (47,364) with an FSM rate of 48.7% (Fig. 5 a; Supplementary Fig. 5). Under the FullAnno condition, ISAtools identified 99,537 isoforms, with FSM rising to 67.3%. Mandalorion reported 73,251 isoforms (FSM: 53.8%), and IsoQuant 75,766 (FSM: 51.8%). Bambu, though detecting 88,961 isoforms, showed extreme annotation dependence, with 99.6% classified as FSM. We next evaluated empirical support using two criteria:(1) SSC support, defined as having at least one read perfectly matching the isoform's SSC, and (2) SSC + ends (SSC + E) support, requiring in addition that read alignment start and end sites be within 50 nt of the isoform's transcript boundaries. ISAtools and Mandalorion maintained > 99.8% and > 97.5% SSC + E support, respectively. IsoQuant showed 80.6–84.3% SSC support, but only 28.4–46.6% SSC + E support. Bambu's SSC + E support varied widely (28.7–76.0%), with a paradoxical drop upon adding reference annotation, indicating potential boundary overfitting (Fig. 5 b). We also measured SSC consistency between NoAnno and FullAnno conditions. Mandalorion showed the highest cross-annotation consistency (~ 90%). ISAtools and IsoQuant followed (61% and 57%), while Bambu showed only 19% overlap (Fig. 5 c; Supplementary Fig. 6). Further analysis revealed distinct detection patterns (Fig. 5 d; Supplementary Fig. 7). ISAtools showed balanced annotation-specific sensitivity, with 4.6% and 34.3% of unique SSCs exclusive to NoAnno and FullAnno, respectively-reflecting adaptability to both novel and reference-supported isoforms. IsoQuant followed a similar trend, though skewed more toward annotation reliance (12.4% NoAnno-exclusive; 30.6% FullAnno-exclusive). Bambu exhibited the highest annotation sensitivity, with 22.6% and 58% of unique SSCs identified exclusively under NoAnno and FullAnno, respectively. Mandalorion, consistent with its conservative detection strategy, showed the lowest proportion of annotation-exclusive SSCs (2.6% NoAnno-exclusive; 7.1% FullAnno-exclusive), indicating minimal dependence on reference annotations but also reduced sensitivity to novel isoforms. For TSS/TES validation, we compared them against known sites from Reference Transcription Starting Sites 22 (refTSS, v3.1) database and the polyASite 23 database (v2.0), using a matching threshold of within 50 nucleotides. ISAtools outperformed all tools under NoAnno, with TSS and TES match rates of 82.3% and 80.3%, respectively (Fig. 5 e). Mandalorion followed closely (81.1%, 80.2%), while Bambu (62.5%, 69.7%) and IsoQuant (49.1%, 54.5%) lagged behind. Under FullAnno, ISAtools' accuracy slightly declined (TSS: 76.1%, TES: 78.0%), likely due to rescuing low-abundance FSM transcripts. Mandalorion (82.1%, 80.7%) remained stable, while IsoQuant (65.5%, 64.9%) and Bambu (67.1%, 63.3%) retained lower matching rates (Fig. 5 e). Analysis of SSCs with multiple TSS/TES revealed that ISAtools better resolved complex boundaries, outperforming Mandalorion in TSS precision among multi-terminal isoforms (Supplementary Fig. 8). Comparison of Computational Efficiency We compared the runtime and memory usage of ISAtools, IsoQuant, Bambu, and Mandalorion across all simulated datasets. Since Mandalorion operates on unaligned reads, only its PDFQ module was evaluated for fair comparison. As shown in Fig. 6 a-b, ISAtools consistently achieved the fastest runtime and lowest memory usage across all conditions. Bambu generally outperformed IsoQuant, though IsoQuant showed a notable runtime increase under FullAnno due to annotation database construction overhead. Mandalorion incurred the highest computational cost, with the longest runtimes and largest memory footprint, particularly when analyzing human datasets under NoAnno conditions. In real-data analysis of the UHRR dataset (~ 80 million reads), ISAtools maintained superior performance, completing isoform reconstruction in ~ 17 minutes under NoAnno and ~ 21 minutes under FullAnno, with peak memory usage of 5.6 GB and 6.2 GB, respectively (Fig. 6 c-d). We also evaluated disk I/O overhead (Supplementary Fig. 9). Bambu showed the lowest read/write demands, whereas Mandalorion’s disk usage was ~ 30-fold higher than other tools (Supplementary Fig. 10), posing limitations for large-scale applications. Discussion Long-read RNA sequencing has enabled direct observation of full-length transcript isoforms, offering unprecedented resolution in transcriptome analysis 6 , 24 . However, its full potential is limited by sequencing errors, structural misannotations, poor transcript boundary resolution, and computational inefficiencies when processing high-depth or annotation-limited datasets 4 , 9 . These challenges particularly hinder accurate isoform reconstruction in complex transcriptomes and non-model systems. To address these limitations, we developed ISAtools, a data-driven and annotation-flexible computational framework designed to reconstruct and quantify full-length RNA isoforms with high accuracy, biological credibility, and computational efficiency (Fig. 7 ). ISAtools introduces a compact and expressive representation of transcript structures through SSC, unifying raw read alignments and reference annotations into a standardized text-based format. This SSC framework supports highly scalable processing and enables annotation-agnostic isoform reconstruction. By incorporating modular correction algorithms tailored to systematic isoform classification errors—namely ISM, NIC, and NNC—ISAtools significantly improves structural precision while retaining sensitivity to novel isoforms. A key innovation of ISAtools is its TSS/TES detection strategy based on density-based clustering of read termini, guided by the theory of coupled transcript boundary selection. This approach provides robust identification of full-length transcript boundaries directly from read data, overcoming the limitations of static annotations and improving alignment with experimentally validated reference datasets. Furthermore, ISAtools employs a truncation-aware quantification model that reassigns read support from partial transcripts to corresponding full-length isoforms, enhancing quantification accuracy across a wide range of sequencing depths. Comprehensive benchmarking on simulated, SIRV spike-in, and real biological datasets demonstrates that ISAtools consistently outperforms state-of-the-art methods in isoform discovery, transcript boundary detection, and computational efficiency. Notably, it maintains high accuracy even under severely reduced annotation conditions, and delivers stable performance across transcriptomic complexities and read depths. These results highlight its robustness and broad applicability in transcriptomic studies across model and non-model organisms. ISAtools also enables biologically credible transcript identification. The majority of its predicted isoforms are strongly supported at both the splice junction and transcript boundary levels. When compared to experimentally defined TSS and TES datasets, ISAtools achieves higher concordance than reference-dependent tools, suggesting its utility in discovering sample-specific isoforms or tissue-specific transcriptional regulation. The framework’s modular architecture and lightweight data representation make it readily extensible to large-scale transcriptome studies, including those involving single-cell, multi-tissue, or pan-species comparisons. Despite its strengths, ISAtools has some limitations. Residual errors may persist in regions with extreme splicing complexity, dense repeat elements, or high fragmentation. Further improvements could include enhanced compatibility with Oxford Nanopore datasets, incorporation of additional biological priors such as epigenetic signals or CAGE/polyA-seq peaks, and extension to noisy or sparse single-cell long-read data. Additionally, more sophisticated modeling of exon connectivity and read error profiles could further improve its performance in challenging transcriptome contexts. ISAtools provides an accurate and scalable solution for full-length RNA isoform reconstruction from long-read RNA-seq data. Its consistent performance across varying levels of annotation completeness and transcriptomic complexity underscores its potential as a generalizable framework for isoform-level analysis. By reducing reliance on static reference annotations while maintaining biological fidelity, ISAtools enables high-resolution transcriptome profiling across diverse biological systems, including complex human tissues and non-model organisms. Methods Data simulation We used IsoSeqSim to generate simulated PacBio long-read RNA-seq data with a total error rate of 1.6%, comprising 0.4% substitutions, 0.6% deletions, and 0.6% insertions. Simulations were conducted at four sequencing depths: 3M, 6M, 12M, and 20M reads. To model realistic transcript expression distributions, we first applied the quantify.py script from the LRGASP simulation pipeline to estimate transcript-level expression from eight real human and mouse RNA-seq samples. These quantifications were then used to sample transcript abundance for simulation. Based on the sampled profiles, we selected subsets of transcripts from GENCODE v38 (human) and GENCODE vM27 (mouse) to construct simulation-specific reference annotations. These customized annotations were used both to guide read simulation and to serve as ground truth for downstream benchmarking. The resulting datasets reflect biologically informed expression profiles and enable systematic evaluation of tool performance across varying sequencing depths. Isoforms identification evaluation All predicted isoforms with non-zero expression were included for isoform identification evaluation. To ensure consistent comparisons, all predictions and reference annotations were converted into SSC format. Evaluation was conducted at both the SSC level and the transcript boundary level. SSC-level accuracy was quantified using precision, recall, and F1-score, based on exact matches between predicted and ground-truth SSCs. TSS and TES were evaluated by comparing the 5’ and 3’ coordinates of each predicted isoform to the corresponding sites in the reference; predictions were considered correct if both ends were within ± 50 bp of the annotated coordinates. To assess biological support, we further evaluated whether SSCs were directly supported by read alignments. An SSC was defined as supported if at least one full-length read exactly matched its splice junction pattern. A TSS/TES prediction was considered supported if, in addition to SSC support, its transcript end fell within ± 50 bp of the aligned read boundary. Novel SSC detection was evaluated using 20% and 40% reduced annotation datasets in which a subset of transcripts was randomly removed while retaining at least one isoform per gene. Removed transcripts served as the ground truth for novel isoforms. Predicted isoforms labeled as “novel” were compared against this reference set using the same SSC-level evaluation metrics, enabling systematic assessment of each tool’s sensitivity and specificity for novel transcript detection. For real biological datasets, where global ground truth is unavailable, we used SQANTI3 to classify predicted isoforms and evaluate their concordance with GENCODE annotations. TSS and TES predictions were further validated against experimentally derived sites from the refTSS v3.1 and polyASite v2.0 databases, using a ± 50 bp window as the matching threshold. Isoforms quantification evaluation For the evaluation of quantification performance, only isoforms with non-zero predicted expression and SSC present in the ground truth were considered. For simulated datasets, transcript-level ground truth was defined by the simulation-specific read counts used to generate the data. At the isoform level, we assessed expression concordance between predicted and true values using four standard metrics: Pearson’s correlation coefficient (r), Spearman’s rank correlation coefficient (ρ), concordance correlation coefficient (CCC), and normalized root mean square error (NRMSE). Let \\(\\:{x}_{i}\\) and \\(\\:{y}_{i}\\) denote the predicted and ground-truth counts for the \\(\\:{i}^{th}\\) isoform: $$\\:r=\\frac{{\\sum\\:}_{i}\\left({x}_{i}-\\stackrel{-}{x}\\right)\\left({y}_{i}-\\stackrel{-}{y}\\right)}{\\sqrt{{\\sum\\:}_{i}{\\left({x}_{i}-\\stackrel{-}{x}\\right)}^{2}}\\bullet\\:\\sqrt{{\\sum\\:}_{i}{\\left({y}_{i}-\\stackrel{-}{y}\\right)}^{2}}}$$ Pearson’s r ranges from − 1 to 1, with 1 indicating perfect linear correlation. $$\\:\\rho\\:=1-\\frac{6\\sum\\:{d}_{i}^{2}}{n({n}^{2}-1)}$$ where \\(\\:{\\text{d}}_{\\text{i}}\\) is the rank difference between \\(\\:{\\text{x}}_{\\text{i}}\\) and \\(\\:{\\text{y}}_{\\text{i}}\\) and \\(\\:\\text{n}\\) is the number of isoforms. Spearman’s \\(\\:{\\rho\\:}\\) evaluates monotonic relationships and is robust to outliers; \\(\\:{\\rho\\:}\\)= 1 indicates perfect rank correlation. $$\\:CCC=\\frac{2r{\\sigma\\:}_{x}{\\sigma\\:}_{y}}{{\\sigma\\:}_{x}^{2}+{\\sigma\\:}_{y}^{2}+{\\left(\\stackrel{-}{x}-\\stackrel{-}{y}\\right)}^{2}}$$ where \\(\\:{{\\sigma\\:}}_{\\text{x}}\\) and \\(\\:{{\\sigma\\:}}_{\\text{y}}\\) are the standard deviations of the predicted and true values, respectively. CCC combines correlation and agreement, and equals 1 when predictions exactly match the ground truth in both scale and location. $$\\:NRMSE=\\frac{\\sqrt{\\frac{1}{n}{\\sum\\:}_{i}{({x}_{i}-{y}_{i})}^{2}}}{max\\left(y\\right)-min\\left(y\\right)}$$ NRMSE measures normalized estimation error; a value of 0 indicates perfect agreement with the ground truth. ISAtools algorithm ISAtools is a computational framework for full-length isoform reconstruction and quantification from long-read RNA-seq data. It requires a reference genome and splice-aware alignments (BAM/SAM), with an optional gene annotation. The pipeline consists of six main modules: (1) extraction of splice site chains (SSC), (2) read preprocessing, (3) SSC grouping and polishing, (4) SSC filtering and correction, (5) TSS/TES identification, and (6) isoform quantification. Below, we describe the key aspects of all six procedures. SSC extraction. For each aligned read, ISAtools parses the CIGAR string to extract spliced segments and represent them as an ordered list of splice sites—defined as SSC. In parallel, alignment identity and coverage are computed: Identity is calculated as $$\\:\\text{I}\\text{d}\\text{e}\\text{n}\\text{t}\\text{i}\\text{t}\\text{y}=1-\\frac{{N}_{\\text{m}\\text{i}\\text{s}\\text{m}\\text{a}\\text{t}\\text{c}\\text{h}}}{{N}_{aligned}}$$ where \\(\\:{\\text{N}}_{\\text{m}\\text{i}\\text{s}\\text{m}\\text{a}\\text{t}\\text{c}\\text{h}}\\:\\)denotes the number of mismatched bases and \\(\\:{\\text{N}}_{\\text{a}\\text{l}\\text{i}\\text{g}\\text{n}\\text{e}\\text{d}}\\) includes matched and inserted bases. Coverage is computed as $$\\:\\text{C}\\text{o}\\text{v}\\text{e}\\text{r}\\text{a}\\text{g}\\text{e}=\\frac{L-\\left({N}_{S}+{N}_{H}\\right)}{L}$$ where \\(\\:\\text{L}\\) is the full read length and \\(\\:{\\text{N}}_{\\text{S}}\\), \\(\\:{\\text{N}}_{\\text{H}}\\) are the soft- and hard-clipped bases, respectively. Reads with identical SSCs are grouped, and their splice donor/acceptor motifs are retrieved from the reference genome. ISAtools outputs two tab-delimited SSC files: one at the read level (with alignment metrics), and one at the unique SSC level (with aggregated frequency and sequence features). Reference annotations, if available, are converted to the same format to enable downstream comparison and integration. SSC Polishing. To reduce splicing noise and improve junction precision, ISAtools performs SSC-level polishing within each transcript group. A group is defined as a set of SSCs sharing the same start and end splice sites (i.e., the 5’ splice site of the first exon and the 3’ splice site of the last exon), representing a common transcriptional locus. Within each group, SSCs containing non-canonical splice motifs (i.e., not GT-AG, GC-AG, or AT-AC) are discarded if their read support accounts for < 25% of the group’s total reads, as such junctions are likely spurious. Next, a consensus-based splice site refinement is applied. For each splice site cluster within ± k bp (default k = 10), ISAtools calculates the support frequency \\(\\:{\\text{f}}_{\\text{i}}\\) for each site \\(\\:\\text{i}\\), and retains only sites satisfying: $$\\:\\frac{{f}_{i}}{{f}_{max}}\\ge\\:\\theta\\:$$ where \\(\\:{f}_{max}\\) is the highest frequency in the cluster and \\(\\:\\theta\\:\\) is a user-defined threshold (default: 0.1). SSCs containing low-support splice sites are filtered, improving both accuracy and consistency in splicing structure. NIC/NN Filtering and Correction. To reduce false-positive classification of novel isoforms caused by minor junction mismatches, ISAtools constructs a local splice graph within each SSC group—defined as all SSCs sharing overlapping start and end splice sites (± 100 bp). Each splice site is treated as a node, and observed splice junctions are added as directed edges, weighted by relative read support: $$\\:{\\omega\\:}_{i\\to\\:j}=\\frac{{f}_{i\\to\\:j}}{\\sum\\:f}$$ where \\(\\:{\\text{f}}_{\\text{i}\\to\\:\\text{j}}\\) is the number of reads supporting the junction from site \\(\\:i\\) to site \\(\\:j\\), and \\(\\:\\sum\\:f\\) is the total support across the group. Four common error types are corrected by traversing weakly connected subgraphs: (1) Exon shift mismatch: If two SSCs differ only by a small, same-direction shift at one or more splice junctions, and their transcript lengths fall within a preset range, the minor variant is corrected to match the dominant (higher support) path. (2) Single splice site errors: Splice sites located within 10 bp are clustered; a low-frequency site is removed if $$\\:{f}_{low}\\le\\:\\alpha\\:\\bullet\\:{f}_{high}$$ where \\(\\:\\alpha\\:\\) is a frequency ratio threshold (default 0.01). (3) Skipped microexon correction: If an exon of ≤ 30 bp is absent in one SSC but fully supported in another within the same group, the exon is restored in the truncated path. (4) Small exon mismatch correction: When SSCs differ by ≤ 10 bp in internal exon length, the lower-frequency variant is adjusted to match the high-confidence junction configuration. This graph-based correction substantially improves the precision of novel isoform classification, eliminating artifacts caused by alignment shifts, low-frequency junctions, and annotation inconsistencies in small exons. ISM Filtering. To suppress false positives arising from transcript truncation, ISAtools implements a frequency-based strategy to identify and filter SSCs classified as ISM. Within each transcript group, we examine whether a given SSC is a strict subset of a longer SSC-i.e., it shares internal splice sites but lacks either the 5’ or 3’ ends. If so, the longer SSC is treated as a candidate full-length source transcript, and a relative frequency score is computed as: $$\\:\\varDelta\\:f={\\text{log}}_{10}\\left({f}_{souce}+\\epsilon\\:\\right)-{\\text{log}}_{10}\\left({f}_{ISM}+\\epsilon\\:\\right)$$ where \\(\\:{f}_{ISM}\\) and \\(\\:{f}_{souce}\\) denote the read support of the truncated and full-length SSCs, respectively, and is \\(\\:{\\epsilon\\:}\\) a small pseudo-count to avoid logarithmic undefined values. A truncated SSC is filtered if: $$\\:\\varDelta\\:f>\\tau\\:$$ where \\(\\:{\\tau\\:}\\) as a user-defined threshold (default: 0). This conservative filtering approach effectively removes lowly expressed fragments nested within dominant isoforms, improving the specificity of full-length isoform reconstruction. TSS/TES identification. To resolve TSS and TES with high precision, ISAtools performs read-end clustering within each unique SSC. Given that reads supporting the same splice structure may differ in their 5’ or 3’ termini due to biological variability or incomplete capture, we use a density-based clustering algorithm (DBSCAN) to identify consensus transcript boundaries. For each SSC, ISAtools clusters the aligned 5’ start and 3’ end coordinates of all supporting reads. The mode of each cluster is used to define candidate TSS and TES positions. If only a single cluster is detected at both ends, the corresponding positions are directly assigned as transcript boundaries. When multiple TSS–TES combinations are observed, ISAtools reassigns read support based on observed pairing frequency. The adjusted support for a given (TSS, TES) pair is computed as: $$\\:{{f}^{{\\prime\\:}}}_{\\left(TSS,TES\\right)}={f}_{SSC}\\bullet\\:\\frac{{n}_{\\left(TSS,TES\\right)}}{\\sum\\:{n}_{\\left(TSS,TES\\right)}}$$ where \\(\\:{f}_{SSC}\\) is the total read support for the SSC, and \\(\\:{n}_{\\left(TSS,TES\\right)}\\) is the number of reads exactly matching that boundary pair. If no observed read directly supports a clustered TSS-TES pair, ISAtools applies a conservative fallback: it selects the most upstream TSS and most downstream TES from the clustered candidates, ensuring that truncated isoforms are not over-inferred due to sparse or noisy boundary signals. Isoform quantification. To quantify transcript expression, ISAtools assigns raw read counts to each SSC based on the number of full-length reads that exactly match its splice structure and transcript boundaries. To correct for truncated isoforms misclassified as independent transcripts, a truncation-aware reallocation strategy is applied. For each truncated SSC, ISAtools identifies candidate full-length isoforms that contain the truncated structure as a strict subset. The observed frequency of the truncated SSC is redistributed to these full-length isoforms in proportion to their relative abundance. The adjusted expression level of transcript \\(\\:{\\rm\\:T}\\) is computed as: $$\\:{Q}_{T}=round\\left({f}_{T}+\\lambda\\:\\bullet\\:\\sum\\:_{s\\in\\:S}{f}_{s}\\bullet\\:\\frac{{f}_{T}}{{\\sum\\:}_{{T}^{{\\prime\\:}}\\in\\:\\mathcal{T}}{f}_{{T}^{{\\prime\\:}}}}\\right)$$ where \\(\\:{f}_{T}\\) is the original read support for transcript \\(\\:T\\), \\(\\:S\\) is the set of truncated SSCs associated with \\(\\:T\\), \\(\\:{f}_{s}\\) is the read support for truncated SSC \\(\\:s\\),\\(\\:\\:\\mathcal{T}\\) is the set of full-length isoforms compatible with \\(\\:s\\), \\(\\:\\lambda\\:\\) is a weighting parameter (default: 1), and \\(\\:round(\\cdot\\:)\\) ensures integer-valued expression estimates. Reference-guided rescue and filtering. When gene annotations are available, ISAtools optionally applies a rescue and filtering module to refine SSC-based isoform predictions by integrating static reference information. This step improves the retention of low-abundance isoforms that align with known annotations, while filtering unsupported or truncated transcripts. First, all predicted and reference isoforms are merged and clustered by shared genomic coordinates (chromosome, strand, and SSC structure). For each cluster \\(\\:G\\), the total expression level is calculated as: $$\\:{S}_{\\text{g}\\text{r}\\text{o}\\text{u}\\text{p}}=\\sum\\:_{i\\in\\:G}{f}_{i}$$ and a binary indicator is assigned to mark whether any isoform in the cluster matches a reference: $$\\:{R}_{\\text{g}\\text{r}\\text{o}\\text{u}\\text{p}}=1\\left\\{\\exists\\:i\\in\\:G:i\\in\\:\\text{R}\\text{e}\\text{f}\\text{e}\\text{r}\\text{e}\\text{n}\\text{c}\\text{e}\\right\\}$$ Clusters with low total support (\\(\\:{S}_{\\text{g}\\text{r}\\text{o}\\text{u}\\text{p}}\\)<25th percentile) are discarded if not supported by annotation (\\(\\:{R}_{\\text{g}\\text{r}\\text{o}\\text{u}\\text{p}}=0\\)). Clusters with total expression frequency below the 25th percentile are considered lowly expressed and are removed if unsupported by reference annotations: $$\\:{\\text{S}}_{\\text{g}\\text{r}\\text{o}\\text{u}\\text{p}}<{\\text{P}\\text{e}\\text{r}\\text{c}\\text{e}\\text{n}\\text{t}\\text{i}\\text{l}\\text{e}}_{25}\\left(\\text{S}\\right),\\:and\\:{\\text{R}}_{\\text{g}\\text{r}\\text{o}\\text{u}\\text{p}}=0$$ For ISM-classified isoforms, ISAtools further computes their relative abundance within the cluster: $$\\:{p}_{i}=\\frac{{f}_{i}}{{\\sum\\:}_{j\\in\\:G}{f}_{i}}$$ and removes those below a predefined threshold \\(\\:\\tau\\:\\) (default: 0.01), reducing the contribution of truncated fragments. For novel isoforms (NIC/NNC), each splice site in the SSC \\(\\:{S}_{q}\\) is aligned to the nearest annotated splice site \\(\\:{S}_{r}\\) in via binary search: $$\\:{\\widehat{S}}_{q}=arg\\underset{{s}_{r}\\in\\:{S}_{r}}{{min}}\\left|{s}_{q}-{s}_{r}\\right|,\\:\\forall\\:{s}_{q}\\in\\:{S}_{q}$$ Based on this mapping, ISAtools applies structural corrections for small shifts, skipped microexons, and spurious exon insertions, constrained by user-defined deviation thresholds. Read preprocessing Minimap2 (version 2.28-r1209) minimap2 -G 400k -ax splice:hq -uf -t 20 genome.fasta reads.fasta > aligned.sam samtools view -b -o aligned.sorted.bam aligned.sam samtools sort -o aligned.sorted.bam aligned.sorted.bam samtools index aligned.sorted.bam ISAtools With annotation: python isatools.py -r genome.fasta --bam aligned.sorted.bam -g annotation.gtf -o output -t 20 Without annotation: python isatools.py -r genome.fasta --bam aligned.sorted.bam -o output -t 20 Isoquant (version 3.6.3) With annotation: isoquant.py --reference genome.fasta --bam aligned.sorted.bam --data_type pacbio_ccs -o output --genedb annotation.gtf -t 20 --complete_genedb Without annotation: isoquant.py --reference genome.fasta --bam aligned.sorted.bam --data_type pacbio_ccs -o output -t 20 --complete_genedb Bambu (version 3.8.0) library(bambu) With annotation: bambuAnnotations <-prepareAnnotations(“annotation.gtf”). se <- bambu(reads = “aligned.sorted.bam”, annotations = bambuAnnotations, genome = “genome.fasta”, ncore = 20) Without annotation: se <- bambu(reads = “aligned.sorted.bam”, annotations = NULL, genome = “genome.fasta”, ncore = 20, NDR = 1) Mandalorion (version 5.2.2) With annotation: python Mando.py -p output -G genome.fasta -f reads.fasta -g annotation.gtf -t 20 Without annotation: python Mando.py -p output -G genome.fasta -f reads.fasta -t 20 SQANTI analysis (version 4.6.0) sqanti3_qc.py gencode.annotation.gtf reference_genome.fa --cpus 20 --force_id_ignore --skipORF --dir sqanti3_out -o sqanti3 --skipORF --report skip Declarations Data availability All datasets used in this study are publicly available. PacBio long-read sequencing data from Mus musculus cortex and heart tissues, used for IsoSeqSim simulations, were obtained from the ENCODE database under file IDs: ENCFF565RLW, ENCFF325BXV, ENCFF584WWA, and ENCFF860CBL. Human prefrontal cortex and left/right ventricular heart tissue data were also obtained from ENCODE under file IDs: ENCFF708BOP, ENCFF827DUW, ENCFF537NCV, and ENCFF615FIC. The SIRV dataset was derived from the SIRV subset of the PacBio Iso-Seq UHRR collection, available at https://downloads.pacbcloud.com/public/dataset/UHR_IsoSeq/. The high-depth PacBio UHRR dataset (Revio platform, SPRQ chemistry) is available at https://downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA/DATA-RevioSPRQ-UHRR2024/. Source data for all figures are provided as Supplementary Data. Code availability ISAtools is available at: https://github.com/Chenhu7/ISAtools. Acknowledgements We acknowledge financial support from the National Natural Science Foundation of China (42107148, 62172369); Special Support Plan for High Level Talents in Zhejiang Province (2021R52019). Figure 1A includes elements adapted from Servier Medical Art (https://smart.servier.com/), provided by Servier and licensed under a Creative Commons Attribution 4.0 International License. Author Contributions Statement Z.X.S., and Q.D. conceived and designed the project. H.C., and Z.X.S. developed the ISAtools software; H.C., and Z.X.S. performed the informatics analysis; H.C., and Z.X.S. coordinated data release and assisted with executing the pipeline. H.C., and Z.X.S. wrote the manuscript and created the figures. All authors have read and approved the final version of this manuscript. Competing Interests Statement The authors declare no competing interests. References Pardo-Palacios FJ et al (2024) Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat Methods 21:1349–1363. 10.1038/s41592-024-02298-3 Sharon D, Tilgner H, Grubert F, Snyder M (2013) A single-molecule long-read survey of the human transcriptome. Nat Biotechnol 31:1009–1014. 10.1038/nbt.2705 Tilgner H, Grubert F, Sharon D, Snyder MP (2014) Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc Natl Acad Sci U S A 111:9869–9874. 10.1073/pnas.1400447111 Kovaka S et al (2019) Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 20:278. 10.1186/s13059-019-1910-1 Tang AD et al (2020) Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat Commun 11:1438. 10.1038/s41467-020-15171-6 Amarasinghe SL et al (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21. 10.1186/s13059-020-1935-5 Burgess DJ, Genomics (2018) Next regeneration sequencing for reference genomes. Nat Rev Genet 19:125. 10.1038/nrg.2018.5 Pollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS (2018) Long reads: their purpose and place. Hum Mol Genet 27:R234–R241. 10.1093/hmg/ddy177 Tardaguila M et al (2018) SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res 28:396–411. 10.1101/gr.222976.117 Reese MG et al (2000) Genome annotation assessment in Drosophila melanogaster. Genome Res 10:483–501. 10.1101/gr.10.4.483 Guigo R et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(1):1–31. 10.1186/gb-2006-7-s1-s2 Engstrom PG et al (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10:1185–1191. 10.1038/nmeth.2722 Steijger T et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177–1184. 10.1038/nmeth.2714 Weirather JL et al (2017) Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 6, 100. 10.12688/f1000research.10571.2 Soneson C et al (2019) A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nat Commun 10:3359. 10.1038/s41467-019-11272-z Manuel JG et al (2023) High Coverage Highly Accurate Long-Read Sequencing of a Mouse Neuronal Cell Line Using the PacBio Revio Sequencer. bioRxiv. 10.1101/2023.06.06.543940 Prjibelski AD et al (2023) Accurate isoform discovery with IsoQuant using long reads. Nat Biotechnol 41:915–918. 10.1038/s41587-022-01565-y Chen Y et al (2023) Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods 20:1187–1195. 10.1038/s41592-023-01908-w Volden R et al (2023) Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads with Mandalorion. Genome Biol 24:167. 10.1186/s13059-023-02999-6 Alfonso-Gonzalez C et al (2023) Sites of transcription initiation drive mRNA isoform selection. Cell 186, 2438–2455 e2422. 10.1016/j.cell.2023.04.012 Pardo-Palacios FJ et al (2024) SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nat Methods 21:793–797. 10.1038/s41592-024-02229-2 Abugessaisa I et al (2019) refTSS: A Reference Data Set for Human and Mouse Transcription Start Sites. J Mol Biol 431:2407–2422. 10.1016/j.jmb.2019.04.045 Herrmann CJ et al (2020) PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3' end sequencing. Nucleic Acids Res 48:D174–D179. 10.1093/nar/gkz918 Monzo C, Liu T, Conesa A (2025) Transcriptomics in the era of long-read sequencing. Nat Rev Genet. 10.1038/s41576-025-00828-z Additional Declarations There is NO Competing Interest. Supplementary Files SupplementaryData.xlsx Supplementary Data SupplementaryInformation.pdf Supplementary Information Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {\"props\":{\"pageProps\":{\"initialData\":{\"identity\":\"rs-7019918\",\"acceptedTermsAndConditions\":true,\"allowDirectSubmit\":true,\"archivedVersions\":[],\"articleType\":\"Article\",\"associatedPublications\":[],\"authors\":[{\"id\":490250450,\"identity\":\"d55cee77-0843-4e09-a5bd-758b35dc7391\",\"order_by\":0,\"name\":\"Zhuo-Xing Shi\",\"email\":\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5ElEQVRIiWNgGAWjYDACCRBhwMDAxt588MEHAxs74rXw8RxLNpxRkJZMpBYgkJPIMZPm+XCIsYGQDvnZPYaPeQru2LVJpCUb2xgcYGZgP3x0Az4tjHPOGBvOMHiW3Mbz+ODjHIM7fAw8aWk38GlhBrpH4oPB4WQ2dqAtOQbPmBkkeMzwamEDaUkAaWEA+sXC4DBjAyEtPFBb7Ng4gFoYiNEiIZFWDPTL4QQ2UCD3GKQlsxHyi/yM5I2Pef4ctpdvB0bljz82dvzsh4/h1QIDiQ1w3xGjHATsiVU4CkbBKBgFIxAAACwHRuNwntgLAAAAAElFTkSuQmCC\",\"orcid\":\"\",\"institution\":\"Sun Yat-sen University\",\"correspondingAuthor\":true,\"prefix\":\"\",\"firstName\":\"Zhuo-Xing\",\"middleName\":\"\",\"lastName\":\"Shi\",\"suffix\":\"\"},{\"id\":490250451,\"identity\":\"fba44365-df4c-448a-9e08-c776cf8a8c60\",\"order_by\":1,\"name\":\"Hu Chen\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Zhejiang Sci-Tech University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Hu\",\"middleName\":\"\",\"lastName\":\"Chen\",\"suffix\":\"\"},{\"id\":490250452,\"identity\":\"b4c08a86-bd1d-4298-b673-fc7acf2f7903\",\"order_by\":2,\"name\":\"Qi Dai\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Zhejiang Sci-Tech University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Qi\",\"middleName\":\"\",\"lastName\":\"Dai\",\"suffix\":\"\"}],\"badges\":[],\"createdAt\":\"2025-07-01 11:45:40\",\"currentVersionCode\":1,\"declarations\":\"\",\"doi\":\"10.21203/rs.3.rs-7019918/v1\",\"doiUrl\":\"https://doi.org/10.21203/rs.3.rs-7019918/v1\",\"draftVersion\":[],\"editorialEvents\":[],\"editorialNote\":\"\",\"failedWorkflow\":false,\"files\":[{\"id\":88508133,\"identity\":\"4a30bf09-5001-4991-9ee1-c22d3629dacc\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:49\",\"extension\":\"png\",\"order_by\":1,\"title\":\"Figure 1\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":380426,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eOverview of the ISAtools framework and benchmarking design. a\\u003c/strong\\u003e, Dataset overview. A total of 512 simulated datasets were generated using IsoSeqSim based on PacBio error profiles, spanning multiple tissues, species, and sequencing depths, each with corresponding ground-truth annotations. Experimental data included publicly available UHRR long-read RNA-seq samples (Revio platform) and SIRV spike-in controls. \\u003cstrong\\u003eb\\u003c/strong\\u003e, Multi-scale evaluation strategy. Benchmarking covered isoform-level reconstruction accuracy on simulated and SIRV datasets, annotation concordance in real data, read-level support, computational efficiency (runtime and memory usage), and overall robustness summarized using radar plots. \\u003cstrong\\u003ec\\u003c/strong\\u003e, Schematic of the ISAtools pipeline. The workflow includes SSC extraction, read preprocessing, grouping and polishing of SSCs, filtering and correction of artifacts, TSS/TES identification via read-end clustering, and isoform quantification with truncation-aware adjustment. See Methods for details.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage1.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/b1d100d2de1b57be9d999df3.png\"},{\"id\":88508148,\"identity\":\"d60208a8-1d72-4aae-a201-5088f226a727\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:54\",\"extension\":\"png\",\"order_by\":2,\"title\":\"Figure 2\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":386558,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eSSC detection accuracy across annotation contexts and transcriptomic complexity. a\\u003c/strong\\u003e, F1-scores for SSC identification across ISAtools, IsoQuant, Bambu, and Mandalorion under FullAnno, ReduceAnno, and NoAnno conditions. \\u003cstrong\\u003eb\\u003c/strong\\u003e, Read-level support for predicted SSCs, measured as the proportion supported by at least one full-length read. \\u003cstrong\\u003ec\\u003c/strong\\u003e, F1-scores for novel SSC detection using 20% and 40% reduced annotations, with ground truth defined by transcripts removed from the reference. \\u003cstrong\\u003ed\\u003c/strong\\u003e, SSC-level detection accuracy on the SIRV spike-in dataset across varying annotation completeness.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage2.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/a57c158863e870d1fc5c5252.png\"},{\"id\":88508145,\"identity\":\"f48681cd-1110-4aab-aca9-84e538ba53cf\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:53\",\"extension\":\"png\",\"order_by\":3,\"title\":\"Figure 3\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":559957,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eEvaluation of TSS and TES detection across annotation contexts and transcriptomic complexity. a, \\u003c/strong\\u003eSchematic of the ISAtools TSS/TES detection strategy. Read 5’ and 3’ ends are clustered using a density-based algorithm (DBSCAN) to identify candidate boundaries, followed by read-level reassignment to representative TSS-TES pairs. \\u003cstrong\\u003eb,\\u003c/strong\\u003e F1-scores for TSS/TES prediction across ISAtools, IsoQuant, Bambu, and Mandalorion under FullAnno, ReduceAnno, and NoAnno conditions in simulated datasets. \\u003cstrong\\u003ec\\u003c/strong\\u003e, Fraction of predicted TSS/TES supported by read termini (within ±50 bp), indicating alignment-based evidence. \\u003cstrong\\u003ed\\u003c/strong\\u003e, Sensitivity of each tool to boundary annotation, evaluated by comparing prediction accuracy before and after extending reference TSS/TES positions by ±100 bp. \\u003cstrong\\u003ee\\u003c/strong\\u003e, Performance on the SIRV spike-in dataset, reflecting boundary detection robustness in synthetic high-complexity transcriptomes.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage3.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/0cff26f1732bcc8a14d459ff.png\"},{\"id\":88508043,\"identity\":\"b26db21d-8092-4410-aa49-eff13700301d\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:39\",\"extension\":\"png\",\"order_by\":4,\"title\":\"Figure 4\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":974250,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eIsoform quantification performance on simulated datasets. \\u003c/strong\\u003eQuantification accuracy was evaluated using four metrics-CCC, NRMSE, Pearson, and Spearman correlations-across simulated human (\\u003cstrong\\u003ea\\u003c/strong\\u003e) and mouse (\\u003cstrong\\u003eb\\u003c/strong\\u003e) datasets (n = 32 per species). Gray dots represent individual datasets, and gray lines connect results from the same simulated sample under different annotation conditions (FullAnno, ReduceAnno, NoAnno). The central line indicates the median; boxes represent the interquartile range.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage4.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/24b7a1b44c93586307e6b559.png\"},{\"id\":88508150,\"identity\":\"c8007a47-97e3-471e-b687-e286b0d540bf\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:55\",\"extension\":\"png\",\"order_by\":5,\"title\":\"Figure 5\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":295621,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003ePerformance of isoform reconstruction and boundary detection on real long-read RNA-seq datasets.\\u003c/strong\\u003e \\u003cstrong\\u003ea\\u003c/strong\\u003e, Isoform classification results from SQANTI3 using GENCODE v38 as reference, under FullAnno and NoAnno conditions. Bar plots show the proportion of each SQANTI category; overlaid lines indicate the total number of predicted isoforms. \\u003cstrong\\u003eb\\u003c/strong\\u003e, Read-level support for predicted isoforms, assessed at two levels: SSC support, defined as at least one full-length read matching the splice site chain; and SSC+ends support (TSS/TES level), which additionally requires read termini to fall within ±50 bp of the predicted transcript start and end sites. \\u003cstrong\\u003ec\\u003c/strong\\u003e, Overlap of unique SSCs predicted under FullAnno (FA) and NoAnno (NA) conditions. \\u003cstrong\\u003ed\\u003c/strong\\u003e, Distribution of unique SSCs detected exclusively under FullAnno or NoAnno conditions. \\u003cstrong\\u003ee\\u003c/strong\\u003e, Accuracy of TSS and TES predictions, benchmarked against experimentally defined sites from refTSS v3.1 and polyASite v2.0.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage5.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/9d3bb6d3d895d425e77a1837.png\"},{\"id\":88508049,\"identity\":\"e4d8bc28-2352-4769-b4a6-4686364bf32f\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:41\",\"extension\":\"png\",\"order_by\":6,\"title\":\"Figure 6\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":384118,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eComputational efficiency of isoform reconstruction tools.\\u003c/strong\\u003e \\u003cstrong\\u003ea-b\\u003c/strong\\u003e, Runtime and peak memory usage across 512 simulated datasets under different annotation conditions. Mandalorion was evaluated using only its PDFQ module to ensure comparability. \\u003cstrong\\u003ec-d\\u003c/strong\\u003e, Runtime and memory usage for multi-sample analysis of the UHRR real dataset (~80 million cDNA reads). ISAtools completes analysis in under 21 minutes with peak memory usage below 6.5 GB.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage6.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/787e2b8ff1f69f5a38409878.png\"},{\"id\":88509413,\"identity\":\"f67068db-9fc2-457f-b4ab-5cb6f64f2326\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:48:40\",\"extension\":\"png\",\"order_by\":7,\"title\":\"Figure 7\",\"display\":\"\",\"copyAsset\":false,\"role\":\"figure\",\"size\":159986,\"visible\":true,\"origin\":\"\",\"legend\":\"\\u003cp\\u003e\\u003cstrong\\u003eComprehensive assessment across evaluation dimensions. \\u003c/strong\\u003eOverall performance was evaluated across six key metrics under both FullAnno and NoAnno conditions: F1-score for SSC and TSS/TES detection, read-level support for SSCs and transcript boundaries, computational speed, and memory efficiency. F1-scores were averaged across simulated datasets per annotation condition; support scores were computed as the mean of simulated and real datasets. Speed and memory metrics were averaged across datasets and normalized across tools. ISAtools shows balanced and robust performance across all dimensions.\\u003c/p\\u003e\",\"description\":\"\",\"filename\":\"floatimage7.png\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/9a79ec8ead22deddb9176c41.png\"},{\"id\":89171136,\"identity\":\"879ea9f4-d585-45cd-acb9-fd8b54ca5557\",\"added_by\":\"auto\",\"created_at\":\"2025-08-15 19:23:27\",\"extension\":\"pdf\",\"order_by\":0,\"title\":\"\",\"display\":\"\",\"copyAsset\":false,\"role\":\"manuscript-pdf\",\"size\":4055269,\"visible\":true,\"origin\":\"\",\"legend\":\"\",\"description\":\"\",\"filename\":\"manuscript.pdf\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/c1afabc6-6a84-4127-b59d-7fc0b9797cbb.pdf\"},{\"id\":88508048,\"identity\":\"66d5040e-0e08-486b-902d-4f14c1323730\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:40:41\",\"extension\":\"xlsx\",\"order_by\":1,\"title\":\"\",\"display\":\"\",\"copyAsset\":false,\"role\":\"supplement\",\"size\":306273,\"visible\":true,\"origin\":\"\",\"legend\":\"Supplementary Data\",\"description\":\"\",\"filename\":\"SupplementaryData.xlsx\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/d18f961c17544100e82e1b2d.xlsx\"},{\"id\":88508222,\"identity\":\"0a05ae03-967c-4846-9ffd-2f687808bde5\",\"added_by\":\"auto\",\"created_at\":\"2025-08-07 07:41:00\",\"extension\":\"pdf\",\"order_by\":2,\"title\":\"\",\"display\":\"\",\"copyAsset\":false,\"role\":\"supplement\",\"size\":1667822,\"visible\":true,\"origin\":\"\",\"legend\":\"Supplementary Information\",\"description\":\"\",\"filename\":\"SupplementaryInformation.pdf\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-7019918/v1/a24b3cfd8aeba3f085f9f5f6.pdf\"}],\"financialInterests\":\"There is \\u003cb\\u003eNO\\u003c/b\\u003e Competing Interest.\",\"formattedTitle\":\"Efficient Full-Length RNA Isoform Reconstruction with ISAtools\",\"fulltext\":[{\"header\":\"Introduction\",\"content\":\"\\u003cp\\u003eLong-read RNA sequencing enables the direct capture of full-length cDNA sequences, offering a powerful platform for transcript isoform analysis across diverse biological contexts\\u003csup\\u003e\\u003cspan additionalcitationids=\\\"CR2 CR3 CR4\\\" citationid=\\\"CR1\\\" class=\\\"CitationRef\\\"\\u003e1\\u003c/span\\u003e–\\u003cspan citationid=\\\"CR5\\\" class=\\\"CitationRef\\\"\\u003e5\\u003c/span\\u003e\\u003c/sup\\u003e. Compared to short-read sequencing, long-read platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) generate reads that span entire transcripts, facilitating isoform reconstruction with fewer computational assumptions\\u003csup\\u003e\\u003cspan additionalcitationids=\\\"CR7\\\" citationid=\\\"CR6\\\" class=\\\"CitationRef\\\"\\u003e6\\u003c/span\\u003e–\\u003cspan citationid=\\\"CR8\\\" class=\\\"CitationRef\\\"\\u003e8\\u003c/span\\u003e\\u003c/sup\\u003e. Despite these advantages, accurate isoform identification remains challenging due to cDNA fragmentation, sequencing errors and artifacts introduced during library construction\\u003csup\\u003e\\u003cspan citationid=\\\"CR6\\\" class=\\\"CitationRef\\\"\\u003e6\\u003c/span\\u003e,\\u003cspan citationid=\\\"CR9\\\" class=\\\"CitationRef\\\"\\u003e9\\u003c/span\\u003e\\u003c/sup\\u003e.\\u003c/p\\u003e\\u003cp\\u003eBenchmark studies, including those from the Long-read RNA-seq Genome Annotation Assessment Project (LRGASP), have shown that PacBio outperforms ONT in isoform reconstruction accuracy, largely due to higher sequencing fidelity\\u003csup\\u003e\\u003cspan citationid=\\\"CR1\\\" class=\\\"CitationRef\\\"\\u003e1\\u003c/span\\u003e,\\u003cspan additionalcitationids=\\\"CR11 CR12 CR13 CR14\\\" citationid=\\\"CR10\\\" class=\\\"CitationRef\\\"\\u003e10\\u003c/span\\u003e–\\u003cspan citationid=\\\"CR15\\\" class=\\\"CitationRef\\\"\\u003e15\\u003c/span\\u003e\\u003c/sup\\u003e. The introduction of the PacBio Revio system and Super Accuracy (SPRQ) chemistry further improves throughput and base-call accuracy, making it feasible to generate high-depth long-read transcriptomes at scale\\u003csup\\u003e\\u003cspan citationid=\\\"CR16\\\" class=\\\"CitationRef\\\"\\u003e16\\u003c/span\\u003e\\u003c/sup\\u003e. These advances, however, demand correspondingly robust and scalable computational frameworks to fully realize the potential of long-read technologies.\\u003c/p\\u003e\\u003cp\\u003eMultiple tools have been developed for isoform reconstruction from long-read data, each using distinct algorithmic principles. For example, IsoQuant constructs intron graphs from spliced alignments and corrects alignment errors to identify novel isoforms, particularly in high-error ONT datasets\\u003csup\\u003e\\u003cspan citationid=\\\"CR17\\\" class=\\\"CitationRef\\\"\\u003e17\\u003c/span\\u003e\\u003c/sup\\u003e. Bambu infers junction chains, models novel discovery rates to build context-specific references, and uses an expectation-maximization framework to quantify isoform expression\\u003csup\\u003e\\u003cspan citationid=\\\"CR18\\\" class=\\\"CitationRef\\\"\\u003e18\\u003c/span\\u003e\\u003c/sup\\u003e. Mandalorion clusters reads based on junction chains and applies consensus generation and filtering to reconstruct high-confidence isoforms\\u003csup\\u003e\\u003cspan citationid=\\\"CR19\\\" class=\\\"CitationRef\\\"\\u003e19\\u003c/span\\u003e\\u003c/sup\\u003e.\\u003c/p\\u003e\\u003cp\\u003eDespite these efforts, several challenges remain. Existing tools often depend heavily on static annotations or assume complete references, limiting their robustness across diverse transcriptomes. Moreover, accurately identifying transcription start and end sites (TSS and TES)-a critical step for defining truly full-length isoforms–remains underexplored\\u003csup\\u003e\\u003cspan citationid=\\\"CR1\\\" class=\\\"CitationRef\\\"\\u003e1\\u003c/span\\u003e\\u003c/sup\\u003e. This issue is especially pressing in high-depth datasets, where scalability and boundary resolution are essential.\\u003c/p\\u003e\\u003cp\\u003eTo overcome these challenges, we developed ISAtools (Isoform Sequencing Analysis tools), a data-driven computational framework for reconstructing and quantifying full-length RNA isoforms from high-throughput long-read RNA-seq data. ISAtools introduces the Splice Site Chain (SSC), a unified text-based representation that jointly encodes transcript annotations and read alignments, enabling efficient data preprocessing and error correction. This abstraction streamlines data preprocessing and enables efficient error correction targeting high-impact misclassifications, including incomplete splice matches (ISM), novel not in catalog (NNC), and novel in catalog (NIC) isoforms.\\u003c/p\\u003e\\u003cp\\u003eTo improve transcript boundary resolution, ISAtools incorporates a coupled transcript boundary selection model, which clusters genomic coordinates of read termini to accurately define TSS and TES\\u003csup\\u003e\\u003cspan citationid=\\\"CR20\\\" class=\\\"CitationRef\\\"\\u003e20\\u003c/span\\u003e\\u003c/sup\\u003e. This approach enhances isoform completeness by directly leveraging read-end distributions, bypassing the limitations of annotation-dependent models. By integrating SSC structure with TSS and TES localization, ISAtools enables comprehensive and annotation-flexible isoform reconstruction.\\u003c/p\\u003e\\u003cp\\u003eTo further benchmark ISAtools, we established a multi-mode evaluation system and performed systematic benchmarking on simulated datasets, SIRV spike-ins, and real samples sequencing data. We designed three annotation scenarios–Full Annotation Reference (FullAnno), Reduced Reference (ReduceAnno), and No Annotation (NoAnno)–to comprehensively assess performance under varying levels of annotation support and sequencing complexity. Evaluation results indicate that ISAtools, powered by its data-driven architecture, demonstrates superior robustness, particularly in novel isoform discovery, boundary detection, and runtime performance.\\u003c/p\\u003e\\u003cp\\u003eThese results establish ISAtools as a robust and scalable solution for full-length transcriptome analysis in long-read RNA sequencing, with broad utility for isoform-level transcriptomics.\\u003c/p\\u003e\"},{\"header\":\"Result\",\"content\":\"\\u003cp\\u003e\\u003cb\\u003eStudy design and ISAtools framework\\u003c/b\\u003e\\u003c/p\\u003e\\u003cp\\u003eTo benchmark ISAtools against leading long-read RNA-seq analysis tools, we established a comprehensive evaluation framework encompassing dataset construction, metric definition, and performance profiling across multiple analytical dimensions.\\u003c/p\\u003e\\u003cp\\u003eWe assembled datasets from both simulated and real samples. The simulated dataset included human and mouse transcriptomes across two representative tissues (brain and heart), each with two biological replicates. Subsampling at four sequencing depths (3M, 6M, 12M, and 20M reads) yielded 32 distinct datasets. Real-world evaluation used the publicly available PacBio Universal Human Reference RNA (UHRR) dataset, along with a synthetic SIRV (Spike-In RNA Variant) dataset serving as a complex ground truth reference (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig1\\\" class=\\\"InternalRef\\\"\\u003e1\\u003c/span\\u003ea).\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003ePerformance evaluation was organized into three layers (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig1\\\" class=\\\"InternalRef\\\"\\u003e1\\u003c/span\\u003eb): (1) Transcript structure analysis, based on precision, recall, and F1-score for isoform reconstruction, novel transcript detection, and TSS/TES identification; (2) Isoform quantification, assessed using Pearson and Spearman correlations, Concordance Correlation Coefficient (CCC), and normalized root mean square error (NRMSE); and (3) Computational efficiency, evaluated by runtime and memory usage under identical hardware conditions. This multi-dimensional design supports robust and reproducible comparisons across tools.\\u003c/p\\u003e\\u003cp\\u003eSSC classification based on SQANTI\\u003csup\\u003e\\u003cspan citationid=\\\"CR9\\\" class=\\\"CitationRef\\\"\\u003e9\\u003c/span\\u003e,\\u003cspan citationid=\\\"CR21\\\" class=\\\"CitationRef\\\"\\u003e21\\u003c/span\\u003e\\u003c/sup\\u003e revealed that the majority of erroneous events in simulated, SIRV, and real datasets were ISM, NIC, or NNC types (Supplementary Fig.\\u0026nbsp;1). Guided by this, we developed ISAtools, a modular pipeline optimized for reconstructing and quantifying full-length isoforms. The framework comprises six core modules (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig1\\\" class=\\\"InternalRef\\\"\\u003e1\\u003c/span\\u003ec, See Methods for details): (1) SSC extraction, (2) SSC preprocessing, (3) SSC polishing, (4) SSC filtering, (5) TSS/TES detection, and (6) isoform quantification. ISAtools supports both annotation-guided and de novo modes, scales to sequencing depths from 1M to \\u0026gt; 100M reads, and accepts BAM/SAM inputs.\\u003c/p\\u003e\\u003cp\\u003eThe SSC analysis pipeline follows a cascaded, multi-stage architecture. The extraction module parses CIGAR strings to define splice junctions and donor-acceptor motifs, outputting both read-level and unique SSC-level records. The preprocessing module filters SSCs by read identity, coverage, and support. The polishing module performs gene-level aggregation, refines junctions using consensus-based correction, and removes fragmented or low-confidence structures. The filtering module targets ISM, NIC, and NNC artifacts to preserve high-confidence isoforms. When reference annotations are provided, a dedicated Rescue \\u0026amp; Filtering module reintroduces conserved low-abundance SSCs, enhancing isoform completeness.\\u003c/p\\u003e\\u003cp\\u003e\\u003cb\\u003eBenchmarking SSC detection under diverse annotation conditions\\u003c/b\\u003e\\u003c/p\\u003e\\u003cp\\u003eTo assess SSC detection, we benchmarked ISAtools against IsoQuant, Bambu, and Mandalorion across 32 simulated datasets and one SIRV dataset (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig1\\\" class=\\\"InternalRef\\\"\\u003e1\\u003c/span\\u003ea). To emulate incomplete annotations, we generated “Reduced Annotation” versions by randomly removing 20% or 40% of transcripts while ensuring one transcript per gene remained, yielding four evaluation scenarios: NoAnno, ReduceAnno20, ReduceAnno40, and FullAnno. A total of 512 simulated and 16 SIRV test cases were analyzed.\\u003c/p\\u003e\\u003cp\\u003eISAtools consistently achieved high SSC precision and recall across depths and annotation regimes. Performance gains were most evident under NoAnno conditions, where the absence of reference annotations caused all tools to decline, though ISAtools remained the most stable (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig2\\\" class=\\\"InternalRef\\\"\\u003e2\\u003c/span\\u003ea; Supplementary Fig.\\u0026nbsp;2). Human transcriptomes exhibited greater structural complexity than mouse datasets (Supplementary Tables\\u0026nbsp;1–2), amplifying performance differentials (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig2\\\" class=\\\"InternalRef\\\"\\u003e2\\u003c/span\\u003ea; Supplementary Fig.\\u0026nbsp;2). Notably, Mandalorion showed the largest drop under NoAnno at low depth (3M reads), indicating high sensitivity to coverage.\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003eTo evaluate read-level support, we examined the fraction of SSCs fully backed by aligned reads. ISAtools and Mandalorion both exceeded 99.5% support across conditions (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig2\\\" class=\\\"InternalRef\\\"\\u003e2\\u003c/span\\u003eb), whereas IsoQuant and Bambu exhibited slightly lower rates (96.0-99.3% and 98.0-99.2%, respectively).\\u003c/p\\u003e\\u003cp\\u003eWe next assessed novel isoform recovery by comparing SSC detection in the ReduceAnno20 and ReduceAnno40 conditions. ISAtools outperformed all tools across depths and annotation incompleteness, particularly in human datasets (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig2\\\" class=\\\"InternalRef\\\"\\u003e2\\u003c/span\\u003ec; Supplementary Fig.\\u0026nbsp;3). F1-scores ranged from 0.650–0.872 for ISAtools, surpassing IsoQuant (0.638–0.840), Mandalorion (0.580–0.869), and Bambu (0.558–0.733).\\u003c/p\\u003e\\u003cp\\u003eIn the complex SIRV dataset (Supplementary Table\\u0026nbsp;3), ISAtools matched IsoQuant and Bambu under FullAnno but showed marked superiority under ReduceAnno and NoAnno conditions (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig2\\\" class=\\\"InternalRef\\\"\\u003e2\\u003c/span\\u003ed), highlighting its robustness in unannotated contexts.\\u003c/p\\u003e\\u003cp\\u003e\\u003cb\\u003eBenchmarking TSS/TES identification performance\\u003c/b\\u003e\\u003c/p\\u003e\\u003cp\\u003eAlthough long-read sequencing has improved splice junction detection, accurate identification of TSS/TES remains less well explored. Recent findings suggest TSS-TES pairing is globally coordinated and cell-type dependent\\u003csup\\u003e\\u003cspan citationid=\\\"CR20\\\" class=\\\"CitationRef\\\"\\u003e20\\u003c/span\\u003e\\u003c/sup\\u003e, underscoring the need for flexible, data-driven models over static annotations.\\u003c/p\\u003e\\u003cp\\u003eTo address this, ISAtools employs a DBSCAN-based clustering module to jointly infer 5' and 3' transcript ends from read termini (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig3\\\" class=\\\"InternalRef\\\"\\u003e3\\u003c/span\\u003ea). Using the SSC-anchored reads as input, we benchmarked ISAtools against IsoQuant, Bambu, and Mandalorion, generating 512 simulated and 16 SIRV evaluations.\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003eISAtools consistently delivered high F1-scores across annotation conditions (FullAnno, ReduceAnno, NoAnno), ranging from 0.652 to 0.979 (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig3\\\" class=\\\"InternalRef\\\"\\u003e3\\u003c/span\\u003eb; Supplementary Fig.\\u0026nbsp;4). In mouse datasets, F1-scores ranged from 0.849 to 0.979, while human datasets yielded an average F1-score of 0.914, despite increased transcriptomic complexity. Mandalorion exhibited moderate performance (F1 = 0.563–0.928), with notable declines under low-depth and high-complexity conditions. By comparison, IsoQuant and Bambu showed stronger annotation dependence, with broader performance ranges (F1 = 0.597–0.987 and 0.577–0.995, respectively).\\u003c/p\\u003e\\u003cp\\u003eWe further evaluated the proportion of TSS/TES predictions supported by read termini (± 50 bp). ISAtools and Mandalorion both exceeded 99.5% concordance across conditions (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig3\\\" class=\\\"InternalRef\\\"\\u003e3\\u003c/span\\u003ec), outperforming IsoQuant (92.9–99.1%) and Bambu (90.3–99.1%).\\u003c/p\\u003e\\u003cp\\u003eTo test annotation dependency, we extended reference TSS/TES by ± 100 bp and re-evaluated all tools. ISAtools and Mandalorion maintained stable performance, whereas IsoQuant and Bambu declined (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig3\\\" class=\\\"InternalRef\\\"\\u003e3\\u003c/span\\u003ed), indicating potential overfitting to reference boundaries. This trend was reinforced in the SIRV dataset, where ISAtools consistently led in boundary accuracy (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig3\\\" class=\\\"InternalRef\\\"\\u003e3\\u003c/span\\u003ee).\\u003c/p\\u003e\\u003cp\\u003e\\u003cb\\u003eBenchmarking Isoform Quantification Performance across Diverse Annotation Contexts\\u003c/b\\u003e\\u003c/p\\u003e\\u003cp\\u003eUsing the 512 simulated datasets generated from SSC-identifiable reads, we systematically evaluated transcript quantification accuracy across four tools: ISAtools, IsoQuant, Bambu, and Mandalorion. We employed multiple metrics-including Pearson correlation, Spearman correlation, Concordance Correlation Coefficient (CCC), and Normalized Root Mean Square Error (NRMSE)-to assess both the accuracy and robustness of quantification across varying transcriptomic complexities and annotation regimes.\\u003c/p\\u003e\\u003cp\\u003eAs shown in Figs.\\u0026nbsp;\\u003cspan refid=\\\"Fig4\\\" class=\\\"InternalRef\\\"\\u003e4\\u003c/span\\u003ea-b, ISAtools consistently achieved high quantification accuracy across all conditions. CCC values ranged from 0.986 to 0.994, while NRMSE remained low (0.041–0.085), indicating minimal estimation bias and high consistency. IsoQuant performed comparably under both full- and no-annotation conditions (CCC: 0.985–0.995; NRMSE: 0.045–0.084), but exhibited a slight performance drop under reduced annotations (CCC: 0.985–0.993; NRMSE: 0.053–0.095), although still outperforming Mandalorion and Bambu in most cases.\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003eMandalorion displayed strong trend-tracking ability (Pearson = 0.991–0.996) but slightly lower CCC values (0.979–0.991), suggesting reduced concordance in absolute expression estimates. Bambu showed moderate performance under full and no annotation (CCC: 0.985–0.995; NRMSE: 0.039–0.091), but quantification degraded under reduced annotation (CCC: 0.971–0.984; NRMSE: 0.065–0.153), indicating greater annotation dependence.\\u003c/p\\u003e\\u003cp\\u003eIn rank-based correlation analysis, Mandalorion achieved the highest Pearson and Spearman correlations (Pearson: 0.991–0.996; Spearman: 0.979–0.991), followed closely by ISAtools (Pearson: 0.988–0.995; Spearman: 0.970–0.993) and IsoQuant (Pearson: 0.985–0.995; Spearman: 0.977–0.995). Bambu showed greater variability (Pearson: 0.971–0.995; Spearman: 0.962–0.993), especially under incomplete annotations.\\u003c/p\\u003e\\u003cp\\u003e\\u003cb\\u003eIsoform reconstruction performance on real biological samples\\u003c/b\\u003e\\u003c/p\\u003e\\u003cp\\u003eTo assess ISAtools under realistic conditions, we benchmarked performance on the PacBio UHRR dataset, which includes six technical replicates sequenced using the Kinnex full-length protocol and Revio SPRQ chemistry, producing ~ 80\\u0026nbsp;million high-quality cDNA reads. Given the absence of ground-truth isoforms, we evaluated: (1) concordance with GENCODE annotations, (2) empirical read support, and (3) boundary-level accuracy.\\u003c/p\\u003e\\u003cp\\u003eFirst, we performed comparative analysis of isoforms detected by each tool against GENCODE annotation using SQANTI3\\u003csup\\u003e21\\u003c/sup\\u003e. Under the NoAnno condition, ISAtools detected a mean of 69,504 isoforms per sample, with 46.8% classified as FSM (full splice match). Mandalorion recovered a similar number (70,359), with a higher FSM proportion (53.3%). IsoQuant detected fewer isoforms (60,200) and a lower FSM rate (37.7%), while Bambu produced the fewest isoforms (47,364) with an FSM rate of 48.7% (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003ea; Supplementary Fig.\\u0026nbsp;5).\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003eUnder the FullAnno condition, ISAtools identified 99,537 isoforms, with FSM rising to 67.3%. Mandalorion reported 73,251 isoforms (FSM: 53.8%), and IsoQuant 75,766 (FSM: 51.8%). Bambu, though detecting 88,961 isoforms, showed extreme annotation dependence, with 99.6% classified as FSM.\\u003c/p\\u003e\\u003cp\\u003eWe next evaluated empirical support using two criteria:(1) SSC support, defined as having at least one read perfectly matching the isoform's SSC, and (2) SSC + ends (SSC + E) support, requiring in addition that read alignment start and end sites be within 50 nt of the isoform's transcript boundaries. ISAtools and Mandalorion maintained \\u0026gt; 99.8% and \\u0026gt; 97.5% SSC + E support, respectively. IsoQuant showed 80.6–84.3% SSC support, but only 28.4–46.6% SSC + E support. Bambu's SSC + E support varied widely (28.7–76.0%), with a paradoxical drop upon adding reference annotation, indicating potential boundary overfitting (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003eb).\\u003c/p\\u003e\\u003cp\\u003eWe also measured SSC consistency between NoAnno and FullAnno conditions. Mandalorion showed the highest cross-annotation consistency (~ 90%). ISAtools and IsoQuant followed (61% and 57%), while Bambu showed only 19% overlap (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003ec; Supplementary Fig.\\u0026nbsp;6).\\u003c/p\\u003e\\u003cp\\u003eFurther analysis revealed distinct detection patterns (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003ed; Supplementary Fig.\\u0026nbsp;7). ISAtools showed balanced annotation-specific sensitivity, with 4.6% and 34.3% of unique SSCs exclusive to NoAnno and FullAnno, respectively-reflecting adaptability to both novel and reference-supported isoforms. IsoQuant followed a similar trend, though skewed more toward annotation reliance (12.4% NoAnno-exclusive; 30.6% FullAnno-exclusive). Bambu exhibited the highest annotation sensitivity, with 22.6% and 58% of unique SSCs identified exclusively under NoAnno and FullAnno, respectively. Mandalorion, consistent with its conservative detection strategy, showed the lowest proportion of annotation-exclusive SSCs (2.6% NoAnno-exclusive; 7.1% FullAnno-exclusive), indicating minimal dependence on reference annotations but also reduced sensitivity to novel isoforms.\\u003c/p\\u003e\\u003cp\\u003eFor TSS/TES validation, we compared them against known sites from Reference Transcription Starting Sites\\u003csup\\u003e\\u003cspan citationid=\\\"CR22\\\" class=\\\"CitationRef\\\"\\u003e22\\u003c/span\\u003e\\u003c/sup\\u003e (refTSS, v3.1) database and the polyASite\\u003csup\\u003e\\u003cspan citationid=\\\"CR23\\\" class=\\\"CitationRef\\\"\\u003e23\\u003c/span\\u003e\\u003c/sup\\u003e database (v2.0), using a matching threshold of within 50 nucleotides. ISAtools outperformed all tools under NoAnno, with TSS and TES match rates of 82.3% and 80.3%, respectively (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003ee). Mandalorion followed closely (81.1%, 80.2%), while Bambu (62.5%, 69.7%) and IsoQuant (49.1%, 54.5%) lagged behind. Under FullAnno, ISAtools' accuracy slightly declined (TSS: 76.1%, TES: 78.0%), likely due to rescuing low-abundance FSM transcripts. Mandalorion (82.1%, 80.7%) remained stable, while IsoQuant (65.5%, 64.9%) and Bambu (67.1%, 63.3%) retained lower matching rates (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig5\\\" class=\\\"InternalRef\\\"\\u003e5\\u003c/span\\u003ee). Analysis of SSCs with multiple TSS/TES revealed that ISAtools better resolved complex boundaries, outperforming Mandalorion in TSS precision among multi-terminal isoforms (Supplementary Fig.\\u0026nbsp;8).\\u003c/p\\u003e\\u003cp\\u003e\\u003cb\\u003eComparison of Computational Efficiency\\u003c/b\\u003e\\u003c/p\\u003e\\u003cp\\u003eWe compared the runtime and memory usage of ISAtools, IsoQuant, Bambu, and Mandalorion across all simulated datasets. Since Mandalorion operates on unaligned reads, only its PDFQ module was evaluated for fair comparison.\\u003c/p\\u003e\\u003cp\\u003eAs shown in Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig6\\\" class=\\\"InternalRef\\\"\\u003e6\\u003c/span\\u003ea-b, ISAtools consistently achieved the fastest runtime and lowest memory usage across all conditions. Bambu generally outperformed IsoQuant, though IsoQuant showed a notable runtime increase under FullAnno due to annotation database construction overhead. Mandalorion incurred the highest computational cost, with the longest runtimes and largest memory footprint, particularly when analyzing human datasets under NoAnno conditions.\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003eIn real-data analysis of the UHRR dataset (~ 80\\u0026nbsp;million reads), ISAtools maintained superior performance, completing isoform reconstruction in ~ 17 minutes under NoAnno and ~ 21 minutes under FullAnno, with peak memory usage of 5.6 GB and 6.2 GB, respectively (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig6\\\" class=\\\"InternalRef\\\"\\u003e6\\u003c/span\\u003ec-d).\\u003c/p\\u003e\\u003cp\\u003eWe also evaluated disk I/O overhead (Supplementary Fig.\\u0026nbsp;9). Bambu showed the lowest read/write demands, whereas Mandalorion’s disk usage was ~ 30-fold higher than other tools (Supplementary Fig.\\u0026nbsp;10), posing limitations for large-scale applications.\\u003c/p\\u003e\"},{\"header\":\"Discussion\",\"content\":\"\\u003cp\\u003eLong-read RNA sequencing has enabled direct observation of full-length transcript isoforms, offering unprecedented resolution in transcriptome analysis\\u003csup\\u003e\\u003cspan citationid=\\\"CR6\\\" class=\\\"CitationRef\\\"\\u003e6\\u003c/span\\u003e,\\u003cspan citationid=\\\"CR24\\\" class=\\\"CitationRef\\\"\\u003e24\\u003c/span\\u003e\\u003c/sup\\u003e. However, its full potential is limited by sequencing errors, structural misannotations, poor transcript boundary resolution, and computational inefficiencies when processing high-depth or annotation-limited datasets\\u003csup\\u003e\\u003cspan citationid=\\\"CR4\\\" class=\\\"CitationRef\\\"\\u003e4\\u003c/span\\u003e,\\u003cspan citationid=\\\"CR9\\\" class=\\\"CitationRef\\\"\\u003e9\\u003c/span\\u003e\\u003c/sup\\u003e. These challenges particularly hinder accurate isoform reconstruction in complex transcriptomes and non-model systems. To address these limitations, we developed ISAtools, a data-driven and annotation-flexible computational framework designed to reconstruct and quantify full-length RNA isoforms with high accuracy, biological credibility, and computational efficiency (Fig.\\u0026nbsp;\\u003cspan refid=\\\"Fig7\\\" class=\\\"InternalRef\\\"\\u003e7\\u003c/span\\u003e).\\u003c/p\\u003e\\u003cp\\u003e\\u003c/p\\u003e\\u003cp\\u003eISAtools introduces a compact and expressive representation of transcript structures through SSC, unifying raw read alignments and reference annotations into a standardized text-based format. This SSC framework supports highly scalable processing and enables annotation-agnostic isoform reconstruction. By incorporating modular correction algorithms tailored to systematic isoform classification errors—namely ISM, NIC, and NNC—ISAtools significantly improves structural precision while retaining sensitivity to novel isoforms.\\u003c/p\\u003e\\u003cp\\u003eA key innovation of ISAtools is its TSS/TES detection strategy based on density-based clustering of read termini, guided by the theory of coupled transcript boundary selection. This approach provides robust identification of full-length transcript boundaries directly from read data, overcoming the limitations of static annotations and improving alignment with experimentally validated reference datasets. Furthermore, ISAtools employs a truncation-aware quantification model that reassigns read support from partial transcripts to corresponding full-length isoforms, enhancing quantification accuracy across a wide range of sequencing depths.\\u003c/p\\u003e\\u003cp\\u003eComprehensive benchmarking on simulated, SIRV spike-in, and real biological datasets demonstrates that ISAtools consistently outperforms state-of-the-art methods in isoform discovery, transcript boundary detection, and computational efficiency. Notably, it maintains high accuracy even under severely reduced annotation conditions, and delivers stable performance across transcriptomic complexities and read depths. These results highlight its robustness and broad applicability in transcriptomic studies across model and non-model organisms.\\u003c/p\\u003e\\u003cp\\u003eISAtools also enables biologically credible transcript identification. The majority of its predicted isoforms are strongly supported at both the splice junction and transcript boundary levels. When compared to experimentally defined TSS and TES datasets, ISAtools achieves higher concordance than reference-dependent tools, suggesting its utility in discovering sample-specific isoforms or tissue-specific transcriptional regulation. The framework’s modular architecture and lightweight data representation make it readily extensible to large-scale transcriptome studies, including those involving single-cell, multi-tissue, or pan-species comparisons.\\u003c/p\\u003e\\u003cp\\u003eDespite its strengths, ISAtools has some limitations. Residual errors may persist in regions with extreme splicing complexity, dense repeat elements, or high fragmentation. Further improvements could include enhanced compatibility with Oxford Nanopore datasets, incorporation of additional biological priors such as epigenetic signals or CAGE/polyA-seq peaks, and extension to noisy or sparse single-cell long-read data. Additionally, more sophisticated modeling of exon connectivity and read error profiles could further improve its performance in challenging transcriptome contexts.\\u003c/p\\u003e\\u003cp\\u003eISAtools provides an accurate and scalable solution for full-length RNA isoform reconstruction from long-read RNA-seq data. Its consistent performance across varying levels of annotation completeness and transcriptomic complexity underscores its potential as a generalizable framework for isoform-level analysis. By reducing reliance on static reference annotations while maintaining biological fidelity, ISAtools enables high-resolution transcriptome profiling across diverse biological systems, including complex human tissues and non-model organisms.\\u003c/p\\u003e\"},{\"header\":\"Methods\",\"content\":\"\\u003cp\\u003e\\u003cstrong\\u003eData simulation\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eWe used IsoSeqSim to generate simulated PacBio long-read RNA-seq data with a total error rate of 1.6%, comprising 0.4% substitutions, 0.6% deletions, and 0.6% insertions. Simulations were conducted at four sequencing depths: 3M, 6M, 12M, and 20M reads.\\u003c/p\\u003e\\n\\u003cp\\u003eTo model realistic transcript expression distributions, we first applied the quantify.py script from the LRGASP simulation pipeline to estimate transcript-level expression from eight real human and mouse RNA-seq samples. These quantifications were then used to sample transcript abundance for simulation.\\u003c/p\\u003e\\n\\u003cp\\u003eBased on the sampled profiles, we selected subsets of transcripts from GENCODE v38 (human) and GENCODE vM27 (mouse) to construct simulation-specific reference annotations. These customized annotations were used both to guide read simulation and to serve as ground truth for downstream benchmarking. The resulting datasets reflect biologically informed expression profiles and enable systematic evaluation of tool performance across varying sequencing depths.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eIsoforms identification evaluation\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eAll predicted isoforms with non-zero expression were included for isoform identification evaluation. To ensure consistent comparisons, all predictions and reference annotations were converted into SSC format. Evaluation was conducted at both the SSC level and the transcript boundary level. SSC-level accuracy was quantified using precision, recall, and F1-score, based on exact matches between predicted and ground-truth SSCs. TSS and TES were evaluated by comparing the 5\\u0026rsquo; and 3\\u0026rsquo; coordinates of each predicted isoform to the corresponding sites in the reference; predictions were considered correct if both ends were within \\u0026plusmn;\\u0026thinsp;50 bp of the annotated coordinates.\\u003c/p\\u003e\\n\\u003cp\\u003eTo assess biological support, we further evaluated whether SSCs were directly supported by read alignments. An SSC was defined as supported if at least one full-length read exactly matched its splice junction pattern. A TSS/TES prediction was considered supported if, in addition to SSC support, its transcript end fell within \\u0026plusmn;\\u0026thinsp;50 bp of the aligned read boundary.\\u003c/p\\u003e\\n\\u003cp\\u003eNovel SSC detection was evaluated using 20% and 40% reduced annotation datasets in which a subset of transcripts was randomly removed while retaining at least one isoform per gene. Removed transcripts served as the ground truth for novel isoforms. Predicted isoforms labeled as \\u0026ldquo;novel\\u0026rdquo; were compared against this reference set using the same SSC-level evaluation metrics, enabling systematic assessment of each tool\\u0026rsquo;s sensitivity and specificity for novel transcript detection.\\u003c/p\\u003e\\n\\u003cp\\u003eFor real biological datasets, where global ground truth is unavailable, we used SQANTI3 to classify predicted isoforms and evaluate their concordance with GENCODE annotations. TSS and TES predictions were further validated against experimentally derived sites from the refTSS v3.1 and polyASite v2.0 databases, using a\\u0026thinsp;\\u0026plusmn;\\u0026thinsp;50 bp window as the matching threshold.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eIsoforms quantification evaluation\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eFor the evaluation of quantification performance, only isoforms with non-zero predicted expression and SSC present in the ground truth were considered. For simulated datasets, transcript-level ground truth was defined by the simulation-specific read counts used to generate the data.\\u003c/p\\u003e\\n\\u003cp\\u003eAt the isoform level, we assessed expression concordance between predicted and true values using four standard metrics: Pearson\\u0026rsquo;s correlation coefficient (r), Spearman\\u0026rsquo;s rank correlation coefficient (\\u0026rho;), concordance correlation coefficient (CCC), and normalized root mean square error (NRMSE). Let \\\\(\\\\:{x}_{i}\\\\) and \\\\(\\\\:{y}_{i}\\\\) denote the predicted and ground-truth counts for the \\\\(\\\\:{i}^{th}\\\\) isoform:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equa\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equa\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:r=\\\\frac{{\\\\sum\\\\:}_{i}\\\\left({x}_{i}-\\\\stackrel{-}{x}\\\\right)\\\\left({y}_{i}-\\\\stackrel{-}{y}\\\\right)}{\\\\sqrt{{\\\\sum\\\\:}_{i}{\\\\left({x}_{i}-\\\\stackrel{-}{x}\\\\right)}^{2}}\\\\bullet\\\\:\\\\sqrt{{\\\\sum\\\\:}_{i}{\\\\left({y}_{i}-\\\\stackrel{-}{y}\\\\right)}^{2}}}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ePearson\\u0026rsquo;s r ranges from \\u0026minus;\\u0026thinsp;1 to 1, with 1 indicating perfect linear correlation.\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equb\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equb\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:\\\\rho\\\\:=1-\\\\frac{6\\\\sum\\\\:{d}_{i}^{2}}{n({n}^{2}-1)}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{\\\\text{d}}_{\\\\text{i}}\\\\) is the rank difference between \\\\(\\\\:{\\\\text{x}}_{\\\\text{i}}\\\\) and \\\\(\\\\:{\\\\text{y}}_{\\\\text{i}}\\\\) and \\\\(\\\\:\\\\text{n}\\\\) is the number of isoforms. Spearman\\u0026rsquo;s \\\\(\\\\:{\\\\rho\\\\:}\\\\) evaluates monotonic relationships and is robust to outliers; \\\\(\\\\:{\\\\rho\\\\:}\\\\)= 1 indicates perfect rank correlation.\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equc\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equc\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:CCC=\\\\frac{2r{\\\\sigma\\\\:}_{x}{\\\\sigma\\\\:}_{y}}{{\\\\sigma\\\\:}_{x}^{2}+{\\\\sigma\\\\:}_{y}^{2}+{\\\\left(\\\\stackrel{-}{x}-\\\\stackrel{-}{y}\\\\right)}^{2}}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{{\\\\sigma\\\\:}}_{\\\\text{x}}\\\\) and \\\\(\\\\:{{\\\\sigma\\\\:}}_{\\\\text{y}}\\\\) are the standard deviations of the predicted and true values, respectively. CCC combines correlation and agreement, and equals 1 when predictions exactly match the ground truth in both scale and location.\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equd\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equd\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:NRMSE=\\\\frac{\\\\sqrt{\\\\frac{1}{n}{\\\\sum\\\\:}_{i}{({x}_{i}-{y}_{i})}^{2}}}{max\\\\left(y\\\\right)-min\\\\left(y\\\\right)}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003eNRMSE measures normalized estimation error; a value of 0 indicates perfect agreement with the ground truth.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eISAtools algorithm\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eISAtools is a computational framework for full-length isoform reconstruction and quantification from long-read RNA-seq data. It requires a reference genome and splice-aware alignments (BAM/SAM), with an optional gene annotation. The pipeline consists of six main modules: (1) extraction of splice site chains (SSC), (2) read preprocessing, (3) SSC grouping and polishing, (4) SSC filtering and correction, (5) TSS/TES identification, and (6) isoform quantification. Below, we describe the key aspects of all six procedures.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eSSC extraction.\\u003c/strong\\u003e For each aligned read, ISAtools parses the CIGAR string to extract spliced segments and represent them as an ordered list of splice sites\\u0026mdash;defined as SSC. In parallel, alignment identity and coverage are computed: Identity is calculated as\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Eque\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Eque\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:\\\\text{I}\\\\text{d}\\\\text{e}\\\\text{n}\\\\text{t}\\\\text{i}\\\\text{t}\\\\text{y}=1-\\\\frac{{N}_{\\\\text{m}\\\\text{i}\\\\text{s}\\\\text{m}\\\\text{a}\\\\text{t}\\\\text{c}\\\\text{h}}}{{N}_{aligned}}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{\\\\text{N}}_{\\\\text{m}\\\\text{i}\\\\text{s}\\\\text{m}\\\\text{a}\\\\text{t}\\\\text{c}\\\\text{h}}\\\\:\\\\)denotes the number of mismatched bases and \\\\(\\\\:{\\\\text{N}}_{\\\\text{a}\\\\text{l}\\\\text{i}\\\\text{g}\\\\text{n}\\\\text{e}\\\\text{d}}\\\\) includes matched and inserted bases. Coverage is computed as\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equf\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equf\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:\\\\text{C}\\\\text{o}\\\\text{v}\\\\text{e}\\\\text{r}\\\\text{a}\\\\text{g}\\\\text{e}=\\\\frac{L-\\\\left({N}_{S}+{N}_{H}\\\\right)}{L}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:\\\\text{L}\\\\) is the full read length and \\\\(\\\\:{\\\\text{N}}_{\\\\text{S}}\\\\), \\\\(\\\\:{\\\\text{N}}_{\\\\text{H}}\\\\) are the soft- and hard-clipped bases, respectively.\\u003c/p\\u003e\\n\\u003cp\\u003eReads with identical SSCs are grouped, and their splice donor/acceptor motifs are retrieved from the reference genome. ISAtools outputs two tab-delimited SSC files: one at the read level (with alignment metrics), and one at the unique SSC level (with aggregated frequency and sequence features). Reference annotations, if available, are converted to the same format to enable downstream comparison and integration.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eSSC Polishing.\\u003c/strong\\u003e To reduce splicing noise and improve junction precision, ISAtools performs SSC-level polishing within each transcript group. A group is defined as a set of SSCs sharing the same start and end splice sites (i.e., the 5\\u0026rsquo; splice site of the first exon and the 3\\u0026rsquo; splice site of the last exon), representing a common transcriptional locus.\\u003c/p\\u003e\\n\\u003cp\\u003eWithin each group, SSCs containing non-canonical splice motifs (i.e., not GT-AG, GC-AG, or AT-AC) are discarded if their read support accounts for \\u0026lt;\\u0026thinsp;25% of the group\\u0026rsquo;s total reads, as such junctions are likely spurious.\\u003c/p\\u003e\\n\\u003cp\\u003eNext, a consensus-based splice site refinement is applied. For each splice site cluster within \\u0026plusmn;\\u0026thinsp;k bp (default k\\u0026thinsp;=\\u0026thinsp;10), ISAtools calculates the support frequency \\\\(\\\\:{\\\\text{f}}_{\\\\text{i}}\\\\) for each site \\\\(\\\\:\\\\text{i}\\\\), and retains only sites satisfying:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equg\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equg\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:\\\\frac{{f}_{i}}{{f}_{max}}\\\\ge\\\\:\\\\theta\\\\:$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{f}_{max}\\\\) is the highest frequency in the cluster and \\\\(\\\\:\\\\theta\\\\:\\\\) is a user-defined threshold (default: 0.1). SSCs containing low-support splice sites are filtered, improving both accuracy and consistency in splicing structure.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eNIC/NN Filtering and Correction.\\u003c/strong\\u003e To reduce false-positive classification of novel isoforms caused by minor junction mismatches, ISAtools constructs a local splice graph within each SSC group\\u0026mdash;defined as all SSCs sharing overlapping start and end splice sites (\\u0026plusmn;\\u0026thinsp;100 bp). Each splice site is treated as a node, and observed splice junctions are added as directed edges, weighted by relative read support:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equh\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equh\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{\\\\omega\\\\:}_{i\\\\to\\\\:j}=\\\\frac{{f}_{i\\\\to\\\\:j}}{\\\\sum\\\\:f}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{\\\\text{f}}_{\\\\text{i}\\\\to\\\\:\\\\text{j}}\\\\) is the number of reads supporting the junction from site \\\\(\\\\:i\\\\) to site \\\\(\\\\:j\\\\), and \\\\(\\\\:\\\\sum\\\\:f\\\\) is the total support across the group.\\u003c/p\\u003e\\n\\u003cp\\u003eFour common error types are corrected by traversing weakly connected subgraphs: (1) Exon shift mismatch: If two SSCs differ only by a small, same-direction shift at one or more splice junctions, and their transcript lengths fall within a preset range, the minor variant is corrected to match the dominant (higher support) path. (2) Single splice site errors: Splice sites located within 10 bp are clustered; a low-frequency site is removed if\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equi\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equi\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{f}_{low}\\\\le\\\\:\\\\alpha\\\\:\\\\bullet\\\\:{f}_{high}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:\\\\alpha\\\\:\\\\) is a frequency ratio threshold (default 0.01). (3) Skipped microexon correction: If an exon of \\u0026le;\\u0026thinsp;30 bp is absent in one SSC but fully supported in another within the same group, the exon is restored in the truncated path. (4) Small exon mismatch correction: When SSCs differ by \\u0026le;\\u0026thinsp;10 bp in internal exon length, the lower-frequency variant is adjusted to match the high-confidence junction configuration.\\u003c/p\\u003e\\n\\u003cp\\u003eThis graph-based correction substantially improves the precision of novel isoform classification, eliminating artifacts caused by alignment shifts, low-frequency junctions, and annotation inconsistencies in small exons.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eISM Filtering.\\u003c/strong\\u003e To suppress false positives arising from transcript truncation, ISAtools implements a frequency-based strategy to identify and filter SSCs classified as ISM. Within each transcript group, we examine whether a given SSC is a strict subset of a longer SSC-i.e., it shares internal splice sites but lacks either the 5\\u0026rsquo; or 3\\u0026rsquo; ends. If so, the longer SSC is treated as a candidate full-length source transcript, and a relative frequency score is computed as:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equj\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equj\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:\\\\varDelta\\\\:f={\\\\text{log}}_{10}\\\\left({f}_{souce}+\\\\epsilon\\\\:\\\\right)-{\\\\text{log}}_{10}\\\\left({f}_{ISM}+\\\\epsilon\\\\:\\\\right)$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{f}_{ISM}\\\\) and \\\\(\\\\:{f}_{souce}\\\\) denote the read support of the truncated and full-length SSCs, respectively, and is \\\\(\\\\:{\\\\epsilon\\\\:}\\\\) a small pseudo-count to avoid logarithmic undefined values. A truncated SSC is filtered if:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equk\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equk\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:\\\\varDelta\\\\:f\\u0026gt;\\\\tau\\\\:$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{\\\\tau\\\\:}\\\\) as a user-defined threshold (default: 0). This conservative filtering approach effectively removes lowly expressed fragments nested within dominant isoforms, improving the specificity of full-length isoform reconstruction.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eTSS/TES identification.\\u003c/strong\\u003e To resolve TSS and TES with high precision, ISAtools performs read-end clustering within each unique SSC. Given that reads supporting the same splice structure may differ in their 5\\u0026rsquo; or 3\\u0026rsquo; termini due to biological variability or incomplete capture, we use a density-based clustering algorithm (DBSCAN) to identify consensus transcript boundaries.\\u003c/p\\u003e\\n\\u003cp\\u003eFor each SSC, ISAtools clusters the aligned 5\\u0026rsquo; start and 3\\u0026rsquo; end coordinates of all supporting reads. The mode of each cluster is used to define candidate TSS and TES positions. If only a single cluster is detected at both ends, the corresponding positions are directly assigned as transcript boundaries.\\u003c/p\\u003e\\n\\u003cp\\u003eWhen multiple TSS\\u0026ndash;TES combinations are observed, ISAtools reassigns read support based on observed pairing frequency. The adjusted support for a given (TSS, TES) pair is computed as:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equl\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equl\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{{f}^{{\\\\prime\\\\:}}}_{\\\\left(TSS,TES\\\\right)}={f}_{SSC}\\\\bullet\\\\:\\\\frac{{n}_{\\\\left(TSS,TES\\\\right)}}{\\\\sum\\\\:{n}_{\\\\left(TSS,TES\\\\right)}}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{f}_{SSC}\\\\) is the total read support for the SSC, and \\\\(\\\\:{n}_{\\\\left(TSS,TES\\\\right)}\\\\) is the number of reads exactly matching that boundary pair.\\u003c/p\\u003e\\n\\u003cp\\u003eIf no observed read directly supports a clustered TSS-TES pair, ISAtools applies a conservative fallback: it selects the most upstream TSS and most downstream TES from the clustered candidates, ensuring that truncated isoforms are not over-inferred due to sparse or noisy boundary signals.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eIsoform quantification.\\u003c/strong\\u003e To quantify transcript expression, ISAtools assigns raw read counts to each SSC based on the number of full-length reads that exactly match its splice structure and transcript boundaries. To correct for truncated isoforms misclassified as independent transcripts, a truncation-aware reallocation strategy is applied.\\u003c/p\\u003e\\n\\u003cp\\u003eFor each truncated SSC, ISAtools identifies candidate full-length isoforms that contain the truncated structure as a strict subset. The observed frequency of the truncated SSC is redistributed to these full-length isoforms in proportion to their relative abundance. The adjusted expression level of transcript \\\\(\\\\:{\\\\rm\\\\:T}\\\\) is computed as:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equm\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equm\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{Q}_{T}=round\\\\left({f}_{T}+\\\\lambda\\\\:\\\\bullet\\\\:\\\\sum\\\\:_{s\\\\in\\\\:S}{f}_{s}\\\\bullet\\\\:\\\\frac{{f}_{T}}{{\\\\sum\\\\:}_{{T}^{{\\\\prime\\\\:}}\\\\in\\\\:\\\\mathcal{T}}{f}_{{T}^{{\\\\prime\\\\:}}}}\\\\right)$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003ewhere \\\\(\\\\:{f}_{T}\\\\) is the original read support for transcript \\\\(\\\\:T\\\\), \\\\(\\\\:S\\\\) is the set of truncated SSCs associated with \\\\(\\\\:T\\\\), \\\\(\\\\:{f}_{s}\\\\) is the read support for truncated SSC \\\\(\\\\:s\\\\),\\\\(\\\\:\\\\:\\\\mathcal{T}\\\\) is the set of full-length isoforms compatible with \\\\(\\\\:s\\\\), \\\\(\\\\:\\\\lambda\\\\:\\\\) is a weighting parameter (default: 1), and \\\\(\\\\:round(\\\\cdot\\\\:)\\\\) ensures integer-valued expression estimates.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eReference-guided rescue and filtering.\\u003c/strong\\u003e When gene annotations are available, ISAtools optionally applies a rescue and filtering module to refine SSC-based isoform predictions by integrating static reference information. This step improves the retention of low-abundance isoforms that align with known annotations, while filtering unsupported or truncated transcripts.\\u003c/p\\u003e\\n\\u003cp\\u003eFirst, all predicted and reference isoforms are merged and clustered by shared genomic coordinates (chromosome, strand, and SSC structure). For each cluster \\\\(\\\\:G\\\\), the total expression level is calculated as:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equn\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equn\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{S}_{\\\\text{g}\\\\text{r}\\\\text{o}\\\\text{u}\\\\text{p}}=\\\\sum\\\\:_{i\\\\in\\\\:G}{f}_{i}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003eand a binary indicator is assigned to mark whether any isoform in the cluster matches a reference:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equo\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equo\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{R}_{\\\\text{g}\\\\text{r}\\\\text{o}\\\\text{u}\\\\text{p}}=1\\\\left\\\\{\\\\exists\\\\:i\\\\in\\\\:G:i\\\\in\\\\:\\\\text{R}\\\\text{e}\\\\text{f}\\\\text{e}\\\\text{r}\\\\text{e}\\\\text{n}\\\\text{c}\\\\text{e}\\\\right\\\\}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003eClusters with low total support (\\\\(\\\\:{S}_{\\\\text{g}\\\\text{r}\\\\text{o}\\\\text{u}\\\\text{p}}\\\\)\\u0026lt;25th percentile) are discarded if not supported by annotation (\\\\(\\\\:{R}_{\\\\text{g}\\\\text{r}\\\\text{o}\\\\text{u}\\\\text{p}}=0\\\\)).\\u003c/p\\u003e\\n\\u003cp\\u003eClusters with total expression frequency below the 25th percentile are considered lowly expressed and are removed if unsupported by reference annotations:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equp\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equp\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{\\\\text{S}}_{\\\\text{g}\\\\text{r}\\\\text{o}\\\\text{u}\\\\text{p}}\\u0026lt;{\\\\text{P}\\\\text{e}\\\\text{r}\\\\text{c}\\\\text{e}\\\\text{n}\\\\text{t}\\\\text{i}\\\\text{l}\\\\text{e}}_{25}\\\\left(\\\\text{S}\\\\right),\\\\:and\\\\:{\\\\text{R}}_{\\\\text{g}\\\\text{r}\\\\text{o}\\\\text{u}\\\\text{p}}=0$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003eFor ISM-classified isoforms, ISAtools further computes their relative abundance within the cluster:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equq\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equq\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{p}_{i}=\\\\frac{{f}_{i}}{{\\\\sum\\\\:}_{j\\\\in\\\\:G}{f}_{i}}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003eand removes those below a predefined threshold \\\\(\\\\:\\\\tau\\\\:\\\\) (default: 0.01), reducing the contribution of truncated fragments.\\u003c/p\\u003e\\n\\u003cp\\u003eFor novel isoforms (NIC/NNC), each splice site in the SSC \\\\(\\\\:{S}_{q}\\\\) is aligned to the nearest annotated splice site \\\\(\\\\:{S}_{r}\\\\) in via binary search:\\u003c/p\\u003e\\n\\u003cdiv id=\\\"Equr\\\"\\u003e\\n \\u003cdiv id=\\\"FileID_Equr\\\" name=\\\"EquationSource\\\"\\u003e$$\\\\:{\\\\widehat{S}}_{q}=arg\\\\underset{{s}_{r}\\\\in\\\\:{S}_{r}}{{min}}\\\\left|{s}_{q}-{s}_{r}\\\\right|,\\\\:\\\\forall\\\\:{s}_{q}\\\\in\\\\:{S}_{q}$$\\u003c/div\\u003e\\n\\u003c/div\\u003e\\n\\u003cp\\u003eBased on this mapping, ISAtools applies structural corrections for small shifts, skipped microexons, and spurious exon insertions, constrained by user-defined deviation thresholds.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eRead preprocessing\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eMinimap2 (version 2.28-r1209)\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eminimap2 -G 400k -ax splice:hq -uf -t 20 genome.fasta reads.fasta \\u0026gt; aligned.sam\\u003c/p\\u003e\\n\\u003cp\\u003esamtools view -b -o aligned.sorted.bam aligned.sam\\u003c/p\\u003e\\n\\u003cp\\u003esamtools sort -o aligned.sorted.bam aligned.sorted.bam\\u003c/p\\u003e\\n\\u003cp\\u003esamtools index aligned.sorted.bam\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eISAtools\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eWith annotation: python isatools.py -r genome.fasta --bam aligned.sorted.bam -g annotation.gtf -o output -t 20\\u003c/p\\u003e\\n\\u003cp\\u003eWithout annotation: python isatools.py -r genome.fasta --bam aligned.sorted.bam -o output -t 20\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eIsoquant (version 3.6.3)\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eWith annotation: isoquant.py --reference genome.fasta --bam aligned.sorted.bam --data_type pacbio_ccs -o output --genedb annotation.gtf -t 20 --complete_genedb\\u003c/p\\u003e\\n\\u003cp\\u003eWithout annotation: isoquant.py --reference genome.fasta --bam aligned.sorted.bam --data_type pacbio_ccs -o output -t 20 --complete_genedb\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eBambu (version 3.8.0)\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003elibrary(bambu)\\u003c/p\\u003e\\n\\u003cp\\u003eWith annotation: bambuAnnotations \\u0026lt;-prepareAnnotations(“annotation.gtf”).\\u003c/p\\u003e\\n\\u003cp\\u003ese \\u0026lt;- bambu(reads = “aligned.sorted.bam”, annotations = bambuAnnotations, genome = “genome.fasta”, ncore = 20)\\u003c/p\\u003e\\n\\u003cp\\u003eWithout annotation:\\u003c/p\\u003e\\n\\u003cp\\u003ese \\u0026lt;- bambu(reads = “aligned.sorted.bam”, annotations = NULL, genome = “genome.fasta”, ncore = 20, NDR = 1)\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eMandalorion (version 5.2.2)\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eWith annotation: python Mando.py -p output -G genome.fasta -f reads.fasta -g annotation.gtf -t 20\\u003c/p\\u003e\\n\\u003cp\\u003eWithout annotation: python Mando.py -p output -G genome.fasta -f reads.fasta -t 20\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eSQANTI analysis (version 4.6.0)\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003esqanti3_qc.py gencode.annotation.gtf reference_genome.fa --cpus 20 --force_id_ignore --skipORF --dir sqanti3_out -o sqanti3 --skipORF --report skip\\u003c/p\\u003e\"},{\"header\":\"Declarations\",\"content\":\"\\u003cp\\u003e\\u003cstrong\\u003eData availability\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eAll datasets used in this study are publicly available. PacBio long-read sequencing data from Mus musculus cortex and heart tissues, used for IsoSeqSim simulations, were obtained from the ENCODE database under file IDs: ENCFF565RLW, ENCFF325BXV, ENCFF584WWA, and ENCFF860CBL. Human prefrontal cortex and left/right ventricular heart tissue data were also obtained from ENCODE under file IDs: ENCFF708BOP, ENCFF827DUW, ENCFF537NCV, and ENCFF615FIC. The SIRV dataset was derived from the SIRV subset of the PacBio Iso-Seq UHRR collection, available at https://downloads.pacbcloud.com/public/dataset/UHR_IsoSeq/. The high-depth PacBio UHRR dataset (Revio platform, SPRQ chemistry) is available at https://downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA/DATA-RevioSPRQ-UHRR2024/. Source data for all figures are provided as Supplementary Data.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eCode availability\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eISAtools is available at: https://github.com/Chenhu7/ISAtools.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eAcknowledgements\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eWe acknowledge financial support from the National Natural Science Foundation of China (42107148, 62172369); Special Support Plan for High Level Talents in Zhejiang Province (2021R52019). Figure 1A includes elements adapted from Servier Medical Art (https://smart.servier.com/), provided by Servier and licensed under a Creative Commons Attribution 4.0 International License.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eAuthor Contributions Statement\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eZ.X.S., and Q.D. conceived and designed the project. H.C., and Z.X.S. developed the ISAtools software; H.C., and Z.X.S. performed the informatics analysis; H.C., and Z.X.S. coordinated data release and assisted with executing the pipeline. H.C., and Z.X.S. wrote the manuscript and created the figures. All authors have read and approved the final version of this manuscript.\\u003c/p\\u003e\\n\\u003cp\\u003e\\u003cstrong\\u003eCompeting Interests Statement\\u003c/strong\\u003e\\u003c/p\\u003e\\n\\u003cp\\u003eThe authors declare no competing interests.\\u003c/p\\u003e\"},{\"header\":\"References\",\"content\":\"\\u003col\\u003e\\u003cli\\u003e\\u003cspan\\u003ePardo-Palacios FJ et al (2024) Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat Methods 21:1349\\u0026ndash;1363. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41592-024-02298-3\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41592-024-02298-3\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eSharon D, Tilgner H, Grubert F, Snyder M (2013) A single-molecule long-read survey of the human transcriptome. Nat Biotechnol 31:1009\\u0026ndash;1014. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/nbt.2705\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/nbt.2705\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eTilgner H, Grubert F, Sharon D, Snyder MP (2014) Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc Natl Acad Sci U S A 111:9869\\u0026ndash;9874. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1073/pnas.1400447111\\u003c/span\\u003e\\u003cspan address=\\\"10.1073/pnas.1400447111\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eKovaka S et al (2019) Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol 20:278. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1186/s13059-019-1910-1\\u003c/span\\u003e\\u003cspan address=\\\"10.1186/s13059-019-1910-1\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eTang AD et al (2020) Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat Commun 11:1438. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41467-020-15171-6\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41467-020-15171-6\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eAmarasinghe SL et al (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1186/s13059-020-1935-5\\u003c/span\\u003e\\u003cspan address=\\\"10.1186/s13059-020-1935-5\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eBurgess DJ, Genomics (2018) Next regeneration sequencing for reference genomes. Nat Rev Genet 19:125. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/nrg.2018.5\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/nrg.2018.5\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003ePollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS (2018) Long reads: their purpose and place. Hum Mol Genet 27:R234\\u0026ndash;R241. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1093/hmg/ddy177\\u003c/span\\u003e\\u003cspan address=\\\"10.1093/hmg/ddy177\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eTardaguila M et al (2018) SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res 28:396\\u0026ndash;411. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1101/gr.222976.117\\u003c/span\\u003e\\u003cspan address=\\\"10.1101/gr.222976.117\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eReese MG et al (2000) Genome annotation assessment in Drosophila melanogaster. Genome Res 10:483\\u0026ndash;501. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1101/gr.10.4.483\\u003c/span\\u003e\\u003cspan address=\\\"10.1101/gr.10.4.483\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eGuigo R et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(1):1\\u0026ndash;31. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1186/gb-2006-7-s1-s2\\u003c/span\\u003e\\u003cspan address=\\\"10.1186/gb-2006-7-s1-s2\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eEngstrom PG et al (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10:1185\\u0026ndash;1191. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/nmeth.2722\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/nmeth.2722\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eSteijger T et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177\\u0026ndash;1184. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/nmeth.2714\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/nmeth.2714\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eWeirather JL et al (2017) Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. \\u003cem\\u003eF1000Res\\u003c/em\\u003e 6, 100. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.12688/f1000research.10571.2\\u003c/span\\u003e\\u003cspan address=\\\"10.12688/f1000research.10571.2\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eSoneson C et al (2019) A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes. Nat Commun 10:3359. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41467-019-11272-z\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41467-019-11272-z\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eManuel JG et al (2023) High Coverage Highly Accurate Long-Read Sequencing of a Mouse Neuronal Cell Line Using the PacBio Revio Sequencer. bioRxiv. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1101/2023.06.06.543940\\u003c/span\\u003e\\u003cspan address=\\\"10.1101/2023.06.06.543940\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003ePrjibelski AD et al (2023) Accurate isoform discovery with IsoQuant using long reads. Nat Biotechnol 41:915\\u0026ndash;918. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41587-022-01565-y\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41587-022-01565-y\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eChen Y et al (2023) Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat Methods 20:1187\\u0026ndash;1195. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41592-023-01908-w\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41592-023-01908-w\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eVolden R et al (2023) Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads with Mandalorion. Genome Biol 24:167. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1186/s13059-023-02999-6\\u003c/span\\u003e\\u003cspan address=\\\"10.1186/s13059-023-02999-6\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eAlfonso-Gonzalez C et al (2023) Sites of transcription initiation drive mRNA isoform selection. \\u003cem\\u003eCell\\u003c/em\\u003e 186, 2438\\u0026ndash;2455 e2422. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1016/j.cell.2023.04.012\\u003c/span\\u003e\\u003cspan address=\\\"10.1016/j.cell.2023.04.012\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003ePardo-Palacios FJ et al (2024) SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. Nat Methods 21:793\\u0026ndash;797. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41592-024-02229-2\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41592-024-02229-2\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eAbugessaisa I et al (2019) refTSS: A Reference Data Set for Human and Mouse Transcription Start Sites. J Mol Biol 431:2407\\u0026ndash;2422. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1016/j.jmb.2019.04.045\\u003c/span\\u003e\\u003cspan address=\\\"10.1016/j.jmb.2019.04.045\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eHerrmann CJ et al (2020) PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3' end sequencing. Nucleic Acids Res 48:D174\\u0026ndash;D179. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1093/nar/gkz918\\u003c/span\\u003e\\u003cspan address=\\\"10.1093/nar/gkz918\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003cli\\u003e\\u003cspan\\u003eMonzo C, Liu T, Conesa A (2025) Transcriptomics in the era of long-read sequencing. Nat Rev Genet. \\u003cspan class=\\\"ExternalRef\\\"\\u003e\\u003cspan class=\\\"RefSource\\\"\\u003e10.1038/s41576-025-00828-z\\u003c/span\\u003e\\u003cspan address=\\\"10.1038/s41576-025-00828-z\\\" targettype=\\\"DOI\\\" class=\\\"RefTarget\\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\u003c/li\\u003e\\u003c/ol\\u003e\"}],\"fulltextSource\":\"\",\"fullText\":\"\",\"funders\":[],\"hasAdminPriorityOnWorkflow\":false,\"hasManuscriptDocX\":true,\"hasOptedInToPreprint\":true,\"hasPassedJournalQc\":\"\",\"hasAnyPriority\":true,\"hideJournal\":true,\"highlight\":\"\",\"institution\":\"\",\"isAcceptedByJournal\":false,\"isAuthorSuppliedPdf\":false,\"isDeskRejected\":\"\",\"isHiddenFromSearch\":false,\"isInQc\":false,\"isInWorkflow\":false,\"isPdf\":false,\"isPdfUpToDate\":true,\"isWithdrawnOrRetracted\":false,\"journal\":{\"display\":true,\"email\":\"info@researchsquare.com\",\"identity\":\"researchsquare\",\"isNatureJournal\":false,\"hasQc\":true,\"allowDirectSubmit\":true,\"externalIdentity\":\"\",\"sideBox\":\"\",\"snPcode\":\"\",\"submissionUrl\":\"/submission\",\"title\":\"Research Square\",\"twitterHandle\":\"researchsquare\",\"acdcEnabled\":true,\"dfaEnabled\":false,\"editorialSystem\":\"\",\"reportingPortfolio\":\"\",\"inReviewEnabled\":false,\"inReviewRevisionsEnabled\":true},\"keywords\":\"\",\"lastPublishedDoi\":\"10.21203/rs.3.rs-7019918/v1\",\"lastPublishedDoiUrl\":\"https://doi.org/10.21203/rs.3.rs-7019918/v1\",\"license\":{\"name\":\"CC BY 4.0\",\"url\":\"https://creativecommons.org/licenses/by/4.0/\"},\"manuscriptAbstract\":\"\\u003cp\\u003eAccurate identification and quantification of full-length RNA isoforms remain challenging in long-read RNA sequencing due to sequencing errors, complex splicing, and incomplete annotations. We present ISAtools, a sequencing data-driven framework that leverages weakly supervised static references to reconstruct and quantify full-length isoforms, including their splicing structures and transcript boundaries. Benchmarking on simulated, SIRV, and biological datasets shows that ISAtools achieves high accuracy across varying sequencing depths, annotation completeness, and transcriptomic complexity, while maintaining fast runtime and low memory usage. These results demonstrate that ISAtools enables efficient and accurate identification and quantification of full-length RNA isoforms from high-throughput long-read RNA sequencing.\\u003c/p\\u003e\",\"manuscriptTitle\":\"Efficient Full-Length RNA Isoform Reconstruction with ISAtools\",\"msid\":\"\",\"msnumber\":\"\",\"nonDraftVersions\":[{\"code\":1,\"date\":\"2025-08-07 07:26:35\",\"doi\":\"10.21203/rs.3.rs-7019918/v1\",\"editorialEvents\":[{\"type\":\"communityComments\",\"content\":0}],\"status\":\"published\",\"journal\":{\"display\":true,\"email\":\"info@researchsquare.com\",\"identity\":\"researchsquare\",\"isNatureJournal\":false,\"hasQc\":true,\"allowDirectSubmit\":true,\"externalIdentity\":\"\",\"sideBox\":\"\",\"snPcode\":\"\",\"submissionUrl\":\"/submission\",\"title\":\"Research Square\",\"twitterHandle\":\"researchsquare\",\"acdcEnabled\":true,\"dfaEnabled\":false,\"editorialSystem\":\"\",\"reportingPortfolio\":\"\",\"inReviewEnabled\":false,\"inReviewRevisionsEnabled\":true}}],\"origin\":\"\",\"ownerIdentity\":\"c3ab1ef6-f42a-4efa-84ba-2145e13d52e3\",\"owner\":[],\"postedDate\":\"August 7th, 2025\",\"published\":true,\"recentEditorialEvents\":[],\"rejectedJournal\":[],\"revision\":\"\",\"amendment\":\"\",\"status\":\"posted\",\"subjectAreas\":[{\"id\":52054251,\"name\":\"Biological sciences/Computational biology and bioinformatics/Software\"},{\"id\":52054252,\"name\":\"Biological sciences/Genetics/Genomics/Transcriptomics\"}],\"tags\":[],\"updatedAt\":\"2025-08-15T19:15:17+00:00\",\"versionOfRecord\":[],\"versionCreatedAt\":\"2025-08-07 07:26:35\",\"video\":\"\",\"vorDoi\":\"\",\"vorDoiUrl\":\"\",\"workflowStages\":[]},\"version\":\"v1\",\"identity\":\"rs-7019918\",\"journalConfig\":\"researchsquare\"},\"__N_SSP\":true},\"page\":\"/article/[identity]/[[...version]]\",\"query\":{\"redirect\":\"/article/rs-7019918\",\"identity\":\"rs-7019918\",\"version\":[\"v1\"]},\"buildId\":\"XKTyCvWXoU3ODBz1xrDgd\",\"isFallback\":false,\"isExperimentalCompile\":false,\"dynamicIds\":[84888],\"gssp\":true,\"scriptLoader\":[]}","source_license":"CC-BY-4.0","license_restricted":false}