Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

doi:10.21203/rs.3.rs-8360344/v1

Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

2025 · doi:10.21203/rs.3.rs-8360344/v1

preprint OA: closed

Full text JSON View at publisher

Full text 110,923 characters · extracted from preprint-html · click to expand

Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences Radim Krupička, Mariana Komárková, Bohuslav Dvorský, Kateřina Kollinová, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8360344/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 06 Apr, 2026 Read the published version in BioData Mining → Version 1 posted 15 You are reading this latest preprint version Abstract Background Gene fusions are critical drivers of oncogenesis and diagnostic biomarkers in various cancers. However, their detection from RNA or DNA sequencing, when performed using traditional analytical methods, encounters challenges related to sample quality, computational complexity, and noise. Although deep learning is more robust, it usually requires large labeled datasets and substantial training resources. Genomic foundation models (GFMs), which are pre-trained on pangenome-scale data, offer a promising solution to these issues. Methods This study presents the first comprehensive benchmark of four transformer-based GFMs, Nucleotide Transformer, Evo2, HyenaDNA, and DNABERT2, for gene fusion detection. Using the curated FusionAI dataset of ~ 52,000 sequences, we extracted embeddings from 10-kilobase-pair (kbp) DNA sequences surrounding fusion breakpoints. We evaluated the quality of these representations qualitatively using t-SNE visualization and quantitatively by training lightweight classifiers (Support Vector Machines and simple Neural Networks) on the fixed embeddings. Results The Nucleotide Transformer achieved the best performance with an accuracy of 0.967 and an F1 score of 0.967. This result outperformed the dedicated deep learning baseline (FusionAI, with an accuracy of 0.894). Evo2 was the second-best performer (accuracy: 0.920), demonstrating robustness derived from evolutionary pretraining. Conversely, DNABERT2 failed to compete (accuracy 0.677–0.723). Furthermore, sample efficiency analysis revealed that the Nucleotide Transformer required only ~ 2,600 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples. Conclusions These findings demonstrate that advanced GFMs, particularly the NT and Evo2 models, generate highly discriminative 'out-of-the-box' embeddings. These embeddings significantly outperform dedicated deep learning baselines while requiring a fraction of the training data and computational time. This suggests that GFMs could be a scalable, data-efficient way of developing precise genomic diagnostic tools, particularly for rare diseases. genomic foundation models gene fusion detection DNA sequence analysis transformer models Nucleotide Transformer Evo2 bioinformatics benchmarking Figures Figure 1 Figure 2 Introduction Gene fusions are genetic alterations that typically arise from large-scale structural changes in the DNA, leading to the joining of two previously non-adjacent genes 1 , 2 . These fusions can result in the production of abnormal proteins or cause dysregulation of gene expression. Aberrant gene fusions are implicated in the pathogenesis of various cancers, including hematologic malignancies such as leukemia, as well as numerous solid tumors. They serve as important diagnostic, prognostic, and therapeutic biomarkers. Fusion events are most commonly detected through RNA sequencing due to its ability to capture expressed fusion transcripts. However, DNA sequencing can also be used, albeit less frequently, due to its higher cost and lower sensitivity to expressed fusions 3 . However, both approaches face significant technical challenges, including degraded RNA samples, variability in library preparation, high computational demands, and data noise — all of which can contribute to both false positives and false negatives 4 , 5 . Standard bioinformatic tools, such as Arriba 6 and STAR-Fusion 7 , have been widely adopted to address these analytical challenges. While these tools offer high specificity, they often lack robustness and generalizability when dealing with variable or low-quality data 7 , 8 . Additionally, these tools require expert-driven parameter tuning to perform optimally 9 , which hinders their scalability for large research cohorts and clinical applications 10 . These limitations have driven the current trend toward machine learning (ML) and deep learning (DL) models, which aim to reliably extract fusion signals from complex, noisy data with minimal manual intervention 11 – 13 . However, developing DL models traditionally presents significant challenges, including the need for substantial labeled training data and substantial domain expertise. These are common bottlenecks in genomics 14 , 15 . The recent emergence of genomic foundation models (GFMs), which are based on large language model architectures, offers a potential solution 16 . GFMs are pre-trained on vast, pangenome-scale datasets and can be fine-tuned for specific downstream tasks. They reportedly achieve high accuracy even with limited data. A pioneering example is DNABERT 17 , which adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture to genomic sequences. DNABERT has demonstrated strong performance in predicting promoter regions, splice sites, and transcription factor binding sites. Building on this foundation, subsequent studies have fine-tuned or extended the model for various tasks. These include msBERT-Promoter 18 , a model for DNA promoter identification and strength estimation; PLANNER 19 , a model for predicting origins of replication sites; BERT-TFBD 20 , and MutBERT 21 . A more recent and versatile model family is the Nucleotide Transformer 22 , designed to predict molecular phenotypes directly from DNA sequences. Trained on 3,202 diverse human genomes and 850 non-human genomes, it leverages multi-task learning and transfer learning to overcome limitations posed by scarce annotated data. Its applications include functional element detection, chromatin accessibility analysis, and variant prioritization. Emerging models also target specific biological domains. For example, the Genetic Transformer 23 identifies causal variants in rare diseases, while HyenaDNA 24 captures long-range dependencies and processes DNA sequences up to one million nucleotides long. The Evo2 model 25 extends HyenaDNA's capabilities by incorporating evolutionary conservation data into its architecture. This enables better prediction of the functional impact of genetic variants by improving the modeling of sequence conservation and long-range dependencies. While these models show great promise across various genomic tasks, their application to gene fusion detection remains unexplored. Because pre-trained GFMs encode rich biological features into vector representations, we hypothesize that these embeddings can be used as inputs for a simple classifier that can distinguish between fusion-positive and fusion-negative sequences. This approach may offer improved accuracy and training efficiency while requiring significantly less labeled data. To test this hypothesis, this article compares the performance of current foundation models. We used the dataset generated from the FusionAI study 11 (previously used for deep learning) to explore whether GFMs could improve performance on this existing benchmark. Methods For model evaluation, we used the dataset introduced and described in FusionAI study 11 , comprising approximately 26K fusion-positive and 26K fusion-negative sequences. From the total of ~ 52K sequences, we used ~ 36K (~ 18K positive and ~ 18K negative) for training, identical to the training set in the original study. The remaining ~ 16K sequences were evenly divided into validation (~ 8K) and test sets (~ 8K), while preserving the same ratio of positive and negative samples. We kept all data partitions consistent across experiments to ensure a fair comparison between models. Each data sample consisted of a 10 kbp sequence surrounding the fusion breakpoint in gene 1 (sequence 1) and gene 2 (sequence 2). Each sequence was encoded using four widely used genomics foundation models based on the transformer architecture: Nucleotide Transformer 22 , HyenaDNA 24 , and Evo2 25 , and DNABERT2 17 . For each model, we tokenized the sequences using the corresponding tokenizer and extracted embeddings from hidden states from an author-recommended selected internal layer, specifically: Nucleotide Transformer (NT) For embedding generation, the Nucleotide Transformer 500M_multi_species_v2 model was used. The Nucleotide Transformer was configured with a maximum sequence length of 1,671 token, and 1024-dimensional embeddings wereextracted from the 20th transformer layer (out of 24 total layers). Given the model's 6-mer tokenization scheme, this creates a receptive field of approximately 10,026 bp, which fully encompasses the 10 kbp input sequences centered on the gene fusion breakpoints. This configuration allowed the model to process the entire region of interest in a single pass without the need for truncation or sliding window strategies. Evo2 (Evo) For the Evo2 model, sequences were encoded using the evo2_7b checkpoint. Through byte-level tokenization, each character in the DNA string was converted to its corresponding UTF-8 integer value. Embeddings were subsequently extracted from the blocks.28.mlp.l3 internal layer, where the embedding dimension was 4096. Unlike k-mer-based approaches, Evo2 utilizes a byte-level tokenizer where each nucleotide maps directly to a single token, preserving single-base resolution. Leveraging the model's StripedHyena architecture designed for long-range genomic modeling, we processed the input sequences at their full 10 kbp length (10,000 tokens) without truncation or downsampling. HyenaDNA (Hyena) The “hyenadna-large-1m-seqlen-hf” model was used with 1 million token index, which employs a character-level tokenizer where each nucleotide corresponds to a single token. HyenaDNA's sub-quadratic operator allowed for efficient processing of the full input at single-nucleotide resolution without the need for downsampling or token aggregation. Embeddings were extracted from the model's final hidden layer, yielding a tensor of shape, where the embedding dimension was 256. DNABERT2 (BERT) For the BERT-based approach, the zhihan1996/DNABERT-2-117M model was utilized. DNABERT-2 employs Byte Pair Encoding (BPE), which results in variable token counts for fixed-length DNA sequences. To enable consistent batch processing and preserve spatial alignment, we standardized all input tensors to a fixed length of 2,143 tokens. This threshold was empirically determined to accommodate the maximum tokenized length of any 10 kbp sequence in our dataset. We implemented a symmetric padding strategy, adding special tokens to both ends of shorter sequences. DNA sequences were tokenized into overlapping k-mers using the model's specific tokenizer. The final embeddings were then extracted from the hidden states of the last transformer layer, where the embedding dimension was 768. From each sequence, only the middle embedding was used, which corresponds to the fusion breakpoint. Due to the contextual nature of the embeddings, this central embedding also encodes information from the surrounding sequence. We concatenated the middle embeddings from both sequences and used the resulting vector for classification. We visualized embedding quality using t-distributed Stochastic Neighbor Embedding (t-SNE) with perplexity = 30 and 1,000 iterations on a random subset of 1,000 samples. Class separability was qualitatively assessed by visual inspection of the 2D projections. Classification For classification, we used two classifiers following the FusionAI article 11 : (i) a support vector machine (SVM) with RBF kernel with C = 1.0 and γ automatically computed as 1/(n_features × variance). Hyperparameters were not tuned via cross-validation; fixed values were used across all experiments. (ii) a fully connected neural network (NN) with an architecture adopted from FusionAI article 11 . We reimplemented the entire CNN architecture as described in FusionAI 11 to allow for a direct comparison (see Fig. 1 ). The classifier architecture consists of a single-layer feedforward neural network with one hidden layer containing 32 neurons with ReLU activation, followed by dropout (p = 0.4) and a softmax output layer for binary classification. The model was compiled using the Adadelta optimizer with categorical cross-entropy loss. Training was performed with a batch size of 256. The neural networks were trained for up to 1000 epochs to ensure training stability and convergence. Evaluation Metrics We evaluated model performance on the test set using accuracy (percentage of correct predictions), precision (weighted average across classes), recall (weighted average across classes), F1 score (weighted average), and area under the ROC curve (AUC-ROC, macro-averaged for multi-class. For neural networks, we report the final epoch performance; for SVMs, and the single training run results. Sample efficiency To assess sample efficiency, we trained models on 19 stratified training subsets ranging from 200 to 36,302 samples. We fit logarithmic functions (y = a·log(x) + b) to the resulting learning curves and computed: (1) the sample size required to reach 95% of final accuracy (Samples@95%) by inverting the fitted curve: x₉₅ = exp((0.95·y_final - b) / a), and (2) a logarithmic efficiency score defined as (Efficiency@95%) y_final / ln(x₉₅), which quantifies accuracy achieved per natural logarithm unit of training data. Higher efficiency scores indicate models that reach high performance with logarithmically fewer samples. Accuracy values were normalized to 0-100% of each model's maximum performance for cross-model visualization. Implementation details All models were implemented in Python 3.11 using Keras 3.0 with PyTorch backend. Neural network training used the Adadelta optimizer with default Keras parameters (learning_rate = 1.0, rho = 0.95, epsilon = 1e-07). SVM models were trained using scikit-learn 1.4 with default parameters unless otherwise specified. Random sampling and data splitting used a fixed seed (42) to ensure reproducibility across different training set sizes. Training was performed on NVIDIA H100 GPUs with 96GB memory. All source code and notebooks with results used for comparison are available at https://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark/ . The data and results files are available on Zenodo 26 . Results Visual Assessment of Embedding Quality We visualized the embeddings of fusion-positive and fusion-negative sequences using t-SNE to qualitatively assess whether genomic foundation models (GFMs) capture features relevant to gene fusion detection. The resulting projections revealed differences in class separability across the evaluated models (see Fig. 2 ). Classification Performance We evaluated the efficacy of these embeddings using two classifiers, an SVM and a neural network, and compared them against the FusionAI baseline model, which is a dedicated deep learning protocol. Table 1 summarizes the results on the full test set (~ 8K samples), and Fig. 2 shows training convergence. Nucleotide Transformer achieved the highest overall performance, significantly outperforming the baseline. Using the SVM classifier, NT achieved an accuracy of 0.967 and an F1 score of 0.967, whereas FusionAI achieved an accuracy of 0.894. The neural network classifier yielded nearly identical results for NT (accuracy: 0.966; ROC AUC: 0.994), demonstrating the robustness of these embeddings regardless of the classification method. Evo2 was the second-best performer, consistently surpassing the FusionAI baseline. Both the support vector machine (SVM) and neural network (NN) classifiers achieved an accuracy and F1 score of 0.920 and a ROC AUC of 0.970–0.975. HyenaDNA produced mixed results. When paired with a simple neural network, its performance was lower than the baseline (accuracy: 0.857 vs. 0.894). However, using an SVM improved HyenaDNA’s performance, achieving an accuracy of 0.900 and slightly surpassing the FusionAI baseline. DNABERT2 failed to compete with the other foundation models or the baseline. Its accuracy ranged from 0.677 (NN) to 0.723 (SVM), and its ROC AUC was significantly lower at 0.745–0.799. Table 1 Comparative performance of embedding models versus the reimplemented FusionAI baseline on the full test set. Model Classifier Accuracy Precision Recall F1 Score ROC AUC FusionAI nn 0.894 0.894 0.894 0.894 0.960 NT nn 0.966 0.966 0.966 0.966 0.994 svm 0.967 0.972 0.962 0.967 0.995 Evo nn 0.920 0.920 0.920 0.920 0.970 svm 0.920 0.920 0.920 0.920 0.975 Hyena nn 0.857 0.858 0.857 0.857 0.936 svm 0.900 0.880 0.925 0.902 0.962 BERT nn 0.677 0.678 0.677 0.676 0.745 svm 0.723 0.737 0.690 0.713 0.799 Computational Efficiency A major advantage of the GFM-based approach observed in this study is the dramatic reduction in training time. The original FusionAI protocol required approximately 40 hours to train for 1,000 epochs to ensure stability. In contrast, the foundation model workflow, comprising embedding extraction and training of the lightweight classifiers, was completed in under 10 minutes. Sample Efficiency To evaluate how well the models perform in data-scarce regimes, we analyzed learning curves across training subsets ranging from 200 to ~ 36,000 samples. We calculated the number of samples required to reach 95% of the model's final accuracy (Samples@95%) and a logarithmic efficiency score (see Table 2 and Fig. 2 ). Nucleotide Transformer demonstrated superior sample efficiency, requiring only 2,581 samples to reach 95% of its peak performance (Efficiency Score: 12.29). Evo2 followed, requiring 4,461 samples (Efficiency Score: 10.94). HyenaDNA required 10,303 samples, approaching the data requirements of the baseline. FusionAI (Baseline) required 14,200 samples to reach its convergence threshold. DNABERT2 was the least efficient, requiring 21,768 samples to stabilize its (comparatively lower) performance. Table 2 Data efficiency benchmark comparing embedding models to the reimplemented FusionAI baseline (trained for 1,000 epochs). Model Samples (at 95%) Accuracy (at 95%) Efficiency Score (at 95%) FusionAI 14,200 0.849 9.35 NT 2,581 0.918 12.29 Evo 4,461 0.874 10.94 Hyena 10,303 0.814 9.27 Bert 21,768 0.643 6.78 Discussion To the best of our knowledge, this study presents the first comprehensive benchmark of genomic foundation models for gene fusion detection in DNA sequences. Our results show that using pre-trained embeddings from advanced GFMs, such as the Nucleotide Transformer and Evo2, significantly improves accuracy compared to dedicated deep learning protocols like FusionAI, while requiring a fraction of the computational resources and training data. The Nucleotide Transformer produced the most distinct class separation, forming dense, non-overlapping clusters of fusion-positive and fusion-negative sequences. This high-quality representation resulted in superior classification performance (accuracy: 0.967), which remained consistent regardless of whether a simple support vector machine (SVM) or neural network was used as the classification head. Evo2 followed closely behind, consistently surpassing the baseline and demonstrating that models incorporating evolutionary information are highly effective at characterizing structural genomic events. A notable finding is that DNABERT2 was unable to compete in this specific benchmark. It showed significant class overlap and lower accuracy (accuracy: 0.677–0.723). This contrasts with recent literature, in which DNABERT2 and its predecessor are successful backbones for various genomic tasks 18 , 19 , 21 , 27 . For example, DNABERT2 has been adapted for viral lineage classification in ViralLM 28 , bacterial genome decoding, and competing against protein language models in downstream protein tasks 29 . The discrepancy between these successes and our findings suggests that the utility of specific GFMs depends heavily on the task at hand. Although DNABERT2’s Byte Pair Encoding and pre-training objectives excel at capturing semantic patterns in viral or bacterial genomes, these objectives may not produce sufficiently discriminative embeddings for the specific structural features that characterize human gene fusions within a 10 kbp window without extensive fine-tuning. HyenaDNA is designed to process context lengths up to one million tokens and showed mixed results. Although it outperformed the baseline when paired with an SVM, it did not achieve the same level of precision as NT or Evo2 in our "frozen embedding" setting. These results are similar to those from the NextVir study 30 , which benchmarked GFMs for oncoviral classification. The NextVir authors also noted that base models provided decent results with simple adapters but that maximizing performance often required fine-tuning strategies, such as Low-Rank Adaptation (LoRA), particularly for Hyena-based models. However, HyenaDNA's potential for structural tasks remains strong. The HyenaCircle model 31 recently demonstrated that HyenaDNA-based architectures can predict extrachromosomal circular DNA (eccDNA) effectively from sequence data. Since eccDNA formation shares mechanistic similarities with gene fusions, which involve DNA breaks and re-ligation, it is likely that HyenaDNA captures the relevant signal. However, this signal is more complex for linear classifiers to disentangle than the representations produced by NT or Evo2. Furthermore, HyenaDNA’s single-nucleotide resolution has proven advantageous in viral taxonomy (e.g., ViTax) 32 , as it can handle high mutation rates better than k-mer-based approaches can. Future work on gene fusions should explore fully tuning or using LoRA adapters to unlock HyenaDNA's full potential for this task. In comparison with RNA-based approaches, it is important to contextualize our work within the broader field of fusion detection. Recent machine learning efforts have primarily focused on identifying chimeric RNAs from RNA-seq data. For instance, transformer-based classifiers for chimeric reads have achieved success with DNABERT-based architectures 33 . However, our approach detects at the DNA level. This is crucial and distinct for clinical scenarios where RNA samples are degraded or unavailable. By replacing complex feature engineering with high-quality pretrained embeddings, we provide a more streamlined alternative that complements existing RNA-based methods. Computational and Sample Efficiency A critical advantage of the GFM-based workflow is its dramatic increase in efficiency. Training time decreased from about 40 hours for the FusionAI baseline to less than 10 minutes for our foundation model approach. Additionally, our analysis of sample efficiency underscores the "few-shot" capabilities of robust GFMs. The Nucleotide Transformer required only ~ 2,500 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples. This makes GFM-based approaches particularly promising for rare disease studies or clinical scenarios where large annotated datasets are scarce. In conclusion, the landscape of genomic foundation models is rapidly expanding with the emergence of models like PathoLM 34 and Embed-Search-Align 35 for diverse tasks, and novel frameworks for unsupervised embedding evaluation 36 . However, our benchmark shows that Nucleotide Transformer and Evo2 provide the most robust "out-of-the-box" representations for human gene fusion detection. These models offer a superior balance of accuracy, speed, and data efficiency, paving the way for scalable, AI-driven genomic diagnostics. Limitations and Future Directions To ensure a rigorous and reproducible comparison with established baselines, this study used the curated FusionAI benchmark dataset. This controlled environment enabled us to isolate the contribution of foundation model embeddings to classification performance rather than confounding the results with sequencing artifacts and variable coverage, which are often present in raw whole-genome sequencing (WGS) data. Although the current evaluation focused on binary classification to validate the discriminative power of these embeddings, this lays the groundwork for more granular structural analysis. Building on these findings, our future work will extend this framework to precisely identify fusion breakpoints and type fusion partners. Breakpoint localization typically requires extensive annotated datasets to prevent overfitting, but our analysis of sample efficiency offers a promising approach. We demonstrated that models such as Nucleotide Transformer and Evo2 converge with significantly fewer samples than traditional baselines. This high learning efficiency suggests that training advanced heads for coordinate regression or token-level segmentation is computationally feasible, even with smaller, high-quality datasets available in clinical settings. Thus, we can bridge the gap between foundation models and precision diagnostics. Conclusion This study presents the first comprehensive benchmark of genomic foundation models (GFMs) for the direct detection of gene fusions from DNA sequences. Our results show that using pre-trained embeddings from advanced models such as the Nucleotide Transformer and Evo2 significantly improves detection accuracy compared with specialised deep learning protocols such as FusionAI. Specifically, the Nucleotide Transformer achieved an accuracy of 0.967, followed by Evo2 with an accuracy of 0.920; both models outperformed the baseline accuracy of 0.894. In addition to superior performance, the GFM-based approach offers dramatic computational efficiency. The training time was reduced from approximately 40 hours required by the original protocol to under 10 minutes. Further analysis of sample efficiency highlighted the robustness of these models: the Nucleotide Transformer required only ~ 2,600 samples to reach 95% of its peak performance, whereas the baseline method needed over 14,000 samples. This characteristic is particularly valuable for clinical applications and rare disease research, where large annotated datasets are often unavailable. Although certain models, such as DNABERT2, were unable to compete in this specific task, the overall findings confirm that GFMs offer a scalable and highly effective alternative to traditional methods. Future work should focus on extending these capabilities and integrating these efficient workflows into clinical genomic diagnostics. Abbreviations ACC: Accuracy; AUC:Area Under the Curve; BERT: Bidirectional Encoder Representations from Transformers; BPE: Byte Pair Encoding; CNN: Convolutional Neural Network; DL: Deep Learning; DNA: Deoxyribonucleic acid; FN: False Negative; FP: False Positive; GFM: Genomic Foundation Model; MCC: Matthews Correlation Coefficient; ML: Machine Learning; NLP: Natural Language Processing; NN: Neural Network; NT: Nucleotide Transformer; SVM: Support Vector Machine; TN: True Negative; TP: True Positive. Declarations Ethics approval and consent to participate: Not applicable. This study utilizes a publicly available benchmark dataset (FusionAI) 11 and does not involve human participants, human data, or animal experiments. Consent for publication: Not applicable. Availability of data and materials: The datasets analyzed during the current study and learned models are available in the Zenodo repository: https://zenodo.org/records/17898581. The source code is available at https://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark . Competing interests: The authors declare that they have no competing interests. Funding: This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS25/188/OHK4/3T/17. Authors' contributions: RK conceived and designed the study, analyzed and interpreted the data, created the software used in the work, and drafted the manuscript. MK analyzed and interpreted the data and drafted the manuscript. BD acquired the data, created the software used in the work, and substantively revised the manuscript. KK acquired and analyzed the data and created the software used in the work. OD conceived and designed the study and substantively revised the manuscript. All authors read and approved the final manuscript. Acknowledgements: Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic References Foltz SM, Gao Q, Yoon CJ, et al. Evolution and structure of clinically relevant gene fusions in multiple myeloma. Nat Commun. 2020;11:2666. Liu SV, Nagasaka M, Atz J, Solca F, Müllauer L. Oncogenic gene fusions in cancer: from biology to therapy. Signal Transduct Target Ther. 2025;10:111. Bao Z, Chai R, Liu X, Wang J. Fusion genes as diagnostic and predictive biomarkers for tumor. Glob Transl Med. 2022;1:1–12. Ahmed J, Torrado C, Chelariu A, Kim S-H, Ahnert JR. (2024) Fusion Challenges in Solid Tumors: Shaping the Landscape of Cancer Care in Precision Medicine. JCO Precis Oncol e2400038. Creason A, Haan D, Dang K, et al. A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery. Cell Syst. 2021;12:827–e8385. Uhrig S, Ellermann J, Walther T, et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 2021;31:448–60. Haas BJ, Dobin A, Li B, Stransky N, Pochet N, Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20:213. Apostolides M, Jiang Y, Husić M, Siddaway R, Hawkins C, Turinsky AL, Brudno M, Ramani AK. MetaFusion: a high-confidence metacaller for filtering and prioritizing RNA-seq gene fusion candidates. Bioinformatics. 2021;37:3144–51. Jin Z, Huang W, Shen N, Li J, Wang X, Dong J, Park PJ, Xi R. Single-cell gene fusion detection by scFusion. Nat Commun. 2022;13:1084. Hsieh G, Bierman R, Szabo L, Lee AG, Freeman DE, Watson N, Sweet-Cordero EA, Salzman J. Statistical algorithms improve accuracy of gene fusion detection. Nucleic Acids Res. 2017;45:e126–126. Kim P, Tan H, Liu J, Kumar H, Zhou X. FusionAI, a DNA-sequence-based deep learning protocol reduces the false positives of human fusion gene prediction. STAR Protoc. 2022;3:101185. Lovino M, Urgese G, Macii E, Di Cataldo S, Ficarra E. A Deep Learning Approach to the Screening of Oncogenic Gene Fusions in Humans. Int J Mol Sci. 2019;20:1645. Lovino M, Ciaburri MS, Urgese G, Di Cataldo S, Ficarra E. DEEPrior: a deep learning tool for the prioritization of gene fusions. Bioinformatics. 2020;36:3248–50. Ching T, Himmelstein DS, Beaulieu-Jones BK, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15:20170387. Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20:389–403. Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. Natl Sci Rev. 2025;12:nwaf028. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20. Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol. 2024;22:126. Wang C, He Z, Jia R, Pan S, Coin LJ, Song J, Li F. PLANNER: A Multi-Scale Deep Language Model for the Origins of Replication Site Prediction. IEEE J Biomed Health Inf. 2024;28:2445–54. Wang K, Zeng X, Zhou J, Liu F, Luan X, Wang X. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform. 2024;25:bbae195. Long W, Su H, Xiong J, Zhang Y. MutBERT: probabilistic genome representation improves genomics foundation models. Bioinformatics. 2025;41:i294–303. Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2025;22:287–97. Liang L, Chen Y, Wang T et al. (2024) Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases. https://doi.org/10.1101/2024.07.18.24310666 Nguyen E, Poli M, Faizi M et al. (2023) HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. https://doi.org/10.48550/ARXIV.2306.15794 Brixi G, Durrant MG, Ku J et al. (2025) Genome modeling and design across all domains of life with Evo 2. https://doi.org/10.1101/2025.02.18.638918 Krupicka R. (2025) Embedding and benchmarks results for fusionAI dataset. https://doi.org/10.5281/ZENODO.17898581 Akotenou G, El Allali A. Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification. Brief Bioinform. 2025;26:bbaf311. Peng C, Shang J, Guan J, Wang D, Sun Y. ViraLM: empowering virus discovery through the genome foundation model. Bioinformatics. 2024;40:btae704. Boshar S, Trop E, De Almeida BP, Copoiu L, Pierrot T. Are genomic language models all you need? Exploring genomic language models on protein downstream tasks. Bioinformatics. 2024;40:btae529. Robertson J, Consul S, Vikalo H. NextVir: Enabling classification of tumor-causing viruses with genomic foundation models. PLOS Comput Biol. 2025;21:e1013360. Li F, Lu W, Bai Y. HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction. Front Genet. 2025;16:1641162. He Y, Zhou F, Bai J, Gao Y, Huang X, Wang Y. ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model. Brief Bioinform. 2024;26:bbaf041. Bonizzoni P, De Felice C, Pirola Y, Rizzi R, Zaccagnino R, Zizza R. Identification of Chimeric RNAs: A Novel Machine Learning Perspective. In: Bansal MS, Chen W, Khudyakov Y, Măndoiu II, Moussa MR, Patterson M, Rajasekaran S, Skums P, Thankachan SV, Zelikovsky A, editors. Comput. Adv. Bio Med. Sci. Cham: Springer Nature Switzerland; 2025. pp. 14–26. Dip SA, Shuvo UA, Chau T, Song H, Choi P, Wang X, Zhang L. (2024) PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model. https://doi.org/10.1101/2024.06.18.599629 Holur P, Enevoldsen KC, Rajesh S, Mboning L, Georgiou T, Bouchard L-S, Pellegrini M, Roychowdhury V. Embed-Search-Align: DNA sequence alignment using Transformer models. Bioinformatics. 2025;41:btaf041. Awasthi R, Mend Mend Arachchige GS, Zhu X. Unsupervised evaluation of pre-trained DNA language model embeddings. BMC Genomics. 2025;26:710. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 06 Apr, 2026 Read the published version in BioData Mining → Version 1 posted Editorial decision: Revision requested 12 Jan, 2026 Reviews received at journal 03 Jan, 2026 Reviews received at journal 30 Dec, 2025 Reviews received at journal 30 Dec, 2025 Reviewers agreed at journal 25 Dec, 2025 Reviewers agreed at journal 23 Dec, 2025 Reviews received at journal 23 Dec, 2025 Reviewers agreed at journal 21 Dec, 2025 Reviewers agreed at journal 21 Dec, 2025 Reviewers agreed at journal 21 Dec, 2025 Reviewers agreed at journal 19 Dec, 2025 Reviewers invited by journal 19 Dec, 2025 Editor assigned by journal 19 Dec, 2025 Submission checks completed at journal 16 Dec, 2025 First submitted to journal 14 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8360344","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":563276564,"identity":"61456d10-c5e3-415f-a16d-d5466bae4c58","order_by":0,"name":"Radim Krupička","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABD0lEQVRIiWNgGAWjYHACgwMMDAlAmrGBsYHBht8AyCBJS5rkBqgWCXxaGCBaQJoYDgO1QABOLebthzceusGQlm9wu7nx44ya8xLmYocbGH62Ha5jYD/+AJsWmTNpBYdzGHIsN9w52Cy54dhtCcvZiQ2MvW2HJRh4EhKwaZFgyDEAaqkwMLiR2Mb4gO12ncHtxAZmhjNALRIMB7Bq4X+DrOXfOQkkLdiDTkICbEsORMvGtgNQLRUgLcxYvS8h8QzoF4M0A0mQX2b2JYP9crCnIl2yjScNuxb+5M2fcyqSDfhutz/82PPNTsJcOv3hgx8G1vz8OEIMAgwYUOMB7G023Oph9hFUMQpGwSgYBSMVAABwr2QVyhaZRgAAAABJRU5ErkJggg==","orcid":"","institution":"Czech Technical University in Prague","correspondingAuthor":true,"prefix":"","firstName":"Radim","middleName":"","lastName":"Krupička","suffix":""},{"id":563276567,"identity":"3a2103bb-c77c-4cad-8c1f-c95295716081","order_by":1,"name":"Mariana Komárková","email":"","orcid":"","institution":"Czech Technical University in Prague","correspondingAuthor":false,"prefix":"","firstName":"Mariana","middleName":"","lastName":"Komárková","suffix":""},{"id":563276568,"identity":"e32cef29-dffd-4d16-a3a3-acfefef0734a","order_by":2,"name":"Bohuslav Dvorský","email":"","orcid":"","institution":"Czech Technical University in Prague","correspondingAuthor":false,"prefix":"","firstName":"Bohuslav","middleName":"","lastName":"Dvorský","suffix":""},{"id":563276569,"identity":"dd5a5a54-f390-4b25-8f42-3c40979f4ac0","order_by":3,"name":"Kateřina Kollinová","email":"","orcid":"","institution":"Czech Technical University in Prague","correspondingAuthor":false,"prefix":"","firstName":"Kateřina","middleName":"","lastName":"Kollinová","suffix":""},{"id":563276570,"identity":"e356ff0c-065b-457b-b8f3-130e6dd60feb","order_by":4,"name":"Ondřej Klempíř","email":"","orcid":"","institution":"Czech Technical University in Prague","correspondingAuthor":false,"prefix":"","firstName":"Ondřej","middleName":"","lastName":"Klempíř","suffix":""}],"badges":[],"createdAt":"2025-12-14 21:53:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8360344/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8360344/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s13040-026-00553-1","type":"published","date":"2026-04-06T15:57:09+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":99309890,"identity":"96b6ea0a-9ac2-4bed-aa78-ff872afad4ce","added_by":"auto","created_at":"2025-12-31 16:11:20","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10529171,"visible":true,"origin":"","legend":"","description":"","filename":"BenchmarkingGFMforGeneFusionDetectionfromDNASequences.docx","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/f4a73e9cb6a0a3725f144fa3.docx"},{"id":98894303,"identity":"76f86f52-5e39-4fc5-abec-a9d9d149ca28","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7856,"visible":true,"origin":"","legend":"","description":"","filename":"bf360f9ae7244eeab9e3f02eaa78bfcb.json","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/3ae73aed693293b6fe7b60c8.json"},{"id":98894305,"identity":"204d8330-f5e1-434f-b6a5-8093425aab21","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":99719,"visible":true,"origin":"","legend":"","description":"","filename":"bf360f9ae7244eeab9e3f02eaa78bfcb1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/b9a2c1449c5f45852e0fc29e.xml"},{"id":98894308,"identity":"562de495-ac27-4db6-af5d-ae5044111403","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":794474,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/29218b9c395d013d142a4550.png"},{"id":98894304,"identity":"a38d1807-7b23-4ba3-8871-0bc781054f5a","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":436692,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/63af9b2a871d0efd43802e9d.png"},{"id":98894307,"identity":"051f9f40-a070-4062-bd26-3c9311a89605","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":105403,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/193605ef887c898e3fa2bbe9.png"},{"id":98894306,"identity":"472bbbb4-5aa9-42fc-99fd-f98003c35314","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":110663,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/030fb173eadbb3117efb30dc.png"},{"id":98894311,"identity":"5f3b4b84-4d33-4f8a-8164-d9cca19e166f","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"xml","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":98574,"visible":true,"origin":"","legend":"","description":"","filename":"bf360f9ae7244eeab9e3f02eaa78bfcb1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/d4ba05ebdbb139308cb7bdc9.xml"},{"id":98894309,"identity":"a96a4ff7-45d8-4e1c-86de-f5f61f5baeac","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"html","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":107167,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/3780c5d5f844a647104bfa4e.html"},{"id":99309301,"identity":"9bef2584-6d1e-4e35-b5aa-7d1a35f62059","added_by":"auto","created_at":"2025-12-31 16:10:03","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":794474,"visible":true,"origin":"","legend":"\u003cp\u003eDiagram comparing FusionAI and Genomic Foundation Models for gene fusion breakpoint (BP) classification.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/3bdae7a889a0d2360f6ff827.png"},{"id":98894301,"identity":"16a1c7a5-fe4a-4f07-ae7e-1c2c3c933ce7","added_by":"auto","created_at":"2025-12-23 17:03:44","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":436692,"visible":true,"origin":"","legend":"\u003cp\u003eComparative analysis of embedding models versus a reimplemented FusionAI classifier for gene fusion detection A) t-SNE visualization of embedding spaces: class separability of fusion-positive and fusion-negative sequences across genomic foundation models. B) Training dynamics (left) and comparative accuracy (right). C) Sample efficiency analysis: Scaling of model accuracy with training data size (left) and log-transformed normalized learning curves with fitted regression lines indicating the 95% convergence threshold (right).\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/47cc9be3a7bcc1f526f0f4af.png"},{"id":106808753,"identity":"70c09785-e0cf-42fb-bbd6-5b91406b9a54","added_by":"auto","created_at":"2026-04-13 16:00:29","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1819583,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8360344/v1/eb522164-eea6-4217-99f0-90695d07711a.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences","fulltext":[{"header":"Introduction","content":"\u003cp\u003eGene fusions are genetic alterations that typically arise from large-scale structural changes in the DNA, leading to the joining of two previously non-adjacent genes\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e. These fusions can result in the production of abnormal proteins or cause dysregulation of gene expression. Aberrant gene fusions are implicated in the pathogenesis of various cancers, including hematologic malignancies such as leukemia, as well as numerous solid tumors. They serve as important diagnostic, prognostic, and therapeutic biomarkers. Fusion events are most commonly detected through RNA sequencing due to its ability to capture expressed fusion transcripts. However, DNA sequencing can also be used, albeit less frequently, due to its higher cost and lower sensitivity to expressed fusions\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. However, both approaches face significant technical challenges, including degraded RNA samples, variability in library preparation, high computational demands, and data noise — all of which can contribute to both false positives and false negatives\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eStandard bioinformatic tools, such as Arriba\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e and STAR-Fusion\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e, have been widely adopted to address these analytical challenges. While these tools offer high specificity, they often lack robustness and generalizability when dealing with variable or low-quality data\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. Additionally, these tools require expert-driven parameter tuning to perform optimally\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, which hinders their scalability for large research cohorts and clinical applications\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThese limitations have driven the current trend toward machine learning (ML) and deep learning (DL) models, which aim to reliably extract fusion signals from complex, noisy data with minimal manual intervention\u003csup\u003e\u003cspan additionalcitationids=\"CR12\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e–\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. However, developing DL models traditionally presents significant challenges, including the need for substantial labeled training data and substantial domain expertise. These are common bottlenecks in genomics\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe recent emergence of genomic foundation models (GFMs), which are based on large language model architectures, offers a potential solution\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. GFMs are pre-trained on vast, pangenome-scale datasets and can be fine-tuned for specific downstream tasks. They reportedly achieve high accuracy even with limited data.\u003c/p\u003e \u003cp\u003eA pioneering example is DNABERT\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e, which adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture to genomic sequences. DNABERT has demonstrated strong performance in predicting promoter regions, splice sites, and transcription factor binding sites. Building on this foundation, subsequent studies have fine-tuned or extended the model for various tasks. These include msBERT-Promoter\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e, a model for DNA promoter identification and strength estimation; PLANNER\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e, a model for predicting origins of replication sites; BERT-TFBD\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e, and MutBERT\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eA more recent and versatile model family is the Nucleotide Transformer\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, designed to predict molecular phenotypes directly from DNA sequences. Trained on 3,202 diverse human genomes and 850 non-human genomes, it leverages multi-task learning and transfer learning to overcome limitations posed by scarce annotated data. Its applications include functional element detection, chromatin accessibility analysis, and variant prioritization. Emerging models also target specific biological domains. For example, the Genetic Transformer\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e identifies causal variants in rare diseases, while HyenaDNA\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e captures long-range dependencies and processes DNA sequences up to one million nucleotides long. The Evo2 model\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e extends HyenaDNA's capabilities by incorporating evolutionary conservation data into its architecture. This enables better prediction of the functional impact of genetic variants by improving the modeling of sequence conservation and long-range dependencies. While these models show great promise across various genomic tasks, their application to gene fusion detection remains unexplored.\u003c/p\u003e \u003cp\u003eBecause pre-trained GFMs encode rich biological features into vector representations, we hypothesize that these embeddings can be used as inputs for a simple classifier that can distinguish between fusion-positive and fusion-negative sequences. This approach may offer improved accuracy and training efficiency while requiring significantly less labeled data. To test this hypothesis, this article compares the performance of current foundation models. We used the dataset generated from the FusionAI study\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e (previously used for deep learning) to explore whether GFMs could improve performance on this existing benchmark.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eFor model evaluation, we used the dataset introduced and described in FusionAI study\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e, comprising approximately 26K fusion-positive and 26K fusion-negative sequences. From the total of ~ 52K sequences, we used ~ 36K (~ 18K positive and ~ 18K negative) for training, identical to the training set in the original study. The remaining ~ 16K sequences were evenly divided into validation (~ 8K) and test sets (~ 8K), while preserving the same ratio of positive and negative samples. We kept all data partitions consistent across experiments to ensure a fair comparison between models.\u003c/p\u003e\u003cp\u003eEach data sample consisted of a 10 kbp sequence surrounding the fusion breakpoint in gene 1 (sequence 1) and gene 2 (sequence 2). Each sequence was encoded using four widely used genomics foundation models based on the transformer architecture: Nucleotide Transformer\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, HyenaDNA\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e, and Evo2\u003csup\u003e25\u003c/sup\u003e, and DNABERT2\u003csup\u003e17\u003c/sup\u003e. For each model, we tokenized the sequences using the corresponding tokenizer and extracted embeddings from hidden states from an author-recommended selected internal layer, specifically:\u003c/p\u003e\u003cp\u003e \u003cstrong\u003eNucleotide Transformer (NT)\u003c/strong\u003e \u003c/p\u003e\u003cp\u003eFor embedding generation, the Nucleotide Transformer 500M_multi_species_v2 model was used. The Nucleotide Transformer was configured with a maximum sequence length of 1,671 token, and 1024-dimensional embeddings wereextracted from the 20th transformer layer (out of 24 total layers). Given the model's 6-mer tokenization scheme, this creates a receptive field of approximately 10,026 bp, which fully encompasses the 10 kbp input sequences centered on the gene fusion breakpoints. This configuration allowed the model to process the entire region of interest in a single pass without the need for truncation or sliding window strategies.\u003c/p\u003e\u003cp\u003e \u003cstrong\u003eEvo2 (Evo)\u003c/strong\u003e \u003c/p\u003e\u003cp\u003eFor the Evo2 model, sequences were encoded using the evo2_7b checkpoint. Through byte-level tokenization, each character in the DNA string was converted to its corresponding UTF-8 integer value. Embeddings were subsequently extracted from the blocks.28.mlp.l3 internal layer, where the embedding dimension was 4096. Unlike k-mer-based approaches, Evo2 utilizes a byte-level tokenizer where each nucleotide maps directly to a single token, preserving single-base resolution. Leveraging the model's StripedHyena architecture designed for long-range genomic modeling, we processed the input sequences at their full 10 kbp length (10,000 tokens) without truncation or downsampling.\u003c/p\u003e\u003cp\u003e \u003cstrong\u003eHyenaDNA (Hyena)\u003c/strong\u003e \u003c/p\u003e\u003cp\u003eThe “hyenadna-large-1m-seqlen-hf” model was used with 1\u0026nbsp;million token index, which employs a character-level tokenizer where each nucleotide corresponds to a single token. HyenaDNA's sub-quadratic operator allowed for efficient processing of the full input at single-nucleotide resolution without the need for downsampling or token aggregation. Embeddings were extracted from the model's final hidden layer, yielding a tensor of shape, where the embedding dimension was 256.\u003c/p\u003e\u003cp\u003e \u003cstrong\u003eDNABERT2 (BERT)\u003c/strong\u003e \u003c/p\u003e\u003cp\u003eFor the BERT-based approach, the zhihan1996/DNABERT-2-117M model was utilized. DNABERT-2 employs Byte Pair Encoding (BPE), which results in variable token counts for fixed-length DNA sequences. To enable consistent batch processing and preserve spatial alignment, we standardized all input tensors to a fixed length of 2,143 tokens. This threshold was empirically determined to accommodate the maximum tokenized length of any 10 kbp sequence in our dataset. We implemented a symmetric padding strategy, adding special tokens to both ends of shorter sequences. DNA sequences were tokenized into overlapping k-mers using the model's specific tokenizer. The final embeddings were then extracted from the hidden states of the last transformer layer, where the embedding dimension was 768.\u003c/p\u003e\u003cp\u003eFrom each sequence, only the middle embedding was used, which corresponds to the fusion breakpoint. Due to the contextual nature of the embeddings, this central embedding also encodes information from the surrounding sequence. We concatenated the middle embeddings from both sequences and used the resulting vector for classification.\u003c/p\u003e\u003cp\u003eWe visualized embedding quality using t-distributed Stochastic Neighbor Embedding (t-SNE) with perplexity = 30 and 1,000 iterations on a random subset of 1,000 samples. Class separability was qualitatively assessed by visual inspection of the 2D projections.\u003c/p\u003e\u003ch3\u003eClassification\u003c/h3\u003e\u003cp\u003eFor classification, we used two classifiers following the FusionAI article\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e : (i) a support vector machine (SVM) with RBF kernel with C = 1.0 and γ automatically computed as 1/(n_features × variance). Hyperparameters were not tuned via cross-validation; fixed values were used across all experiments. (ii) a fully connected neural network (NN) with an architecture adopted from FusionAI article\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. We reimplemented the entire CNN architecture as described in FusionAI\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e to allow for a direct comparison (see Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The classifier architecture consists of a single-layer feedforward neural network with one hidden layer containing 32 neurons with ReLU activation, followed by dropout (p = 0.4) and a softmax output layer for binary classification. The model was compiled using the Adadelta optimizer with categorical cross-entropy loss. Training was performed with a batch size of 256. The neural networks were trained for up to 1000 epochs to ensure training stability and convergence.\u003c/p\u003e\u003ch2\u003eEvaluation Metrics\u003c/h2\u003e\u003cp\u003eWe evaluated model performance on the test set using accuracy (percentage of correct predictions), precision (weighted average across classes), recall (weighted average across classes), F1 score (weighted average), and area under the ROC curve (AUC-ROC, macro-averaged for multi-class. For neural networks, we report the final epoch performance; for SVMs, and the single training run results.\u003c/p\u003e\u003ch3\u003eSample efficiency\u003c/h3\u003e\u003cp\u003eTo assess sample efficiency, we trained models on 19 stratified training subsets ranging from 200 to 36,302 samples. We fit logarithmic functions (y = a·log(x) + b) to the resulting learning curves and computed: (1) the sample size required to reach 95% of final accuracy (Samples@95%) by inverting the fitted curve: x₉₅ = exp((0.95·y_final - b) / a), and (2) a logarithmic efficiency score defined as (Efficiency@95%) y_final / ln(x₉₅), which quantifies accuracy achieved per natural logarithm unit of training data. Higher efficiency scores indicate models that reach high performance with logarithmically fewer samples. Accuracy values were normalized to 0-100% of each model's maximum performance for cross-model visualization.\u003c/p\u003e\u003ch3\u003eImplementation details\u003c/h3\u003e\u003cp\u003eAll models were implemented in Python 3.11 using Keras 3.0 with PyTorch backend. Neural network training used the Adadelta optimizer with default Keras parameters (learning_rate = 1.0, rho = 0.95, epsilon = 1e-07). SVM models were trained using scikit-learn 1.4 with default parameters unless otherwise specified. Random sampling and data splitting used a fixed seed (42) to ensure reproducibility across different training set sizes. Training was performed on NVIDIA H100 GPUs with 96GB memory. All source code and notebooks with results used for comparison are available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark/\u003c/span\u003e\u003cspan address=\"https://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. The data and results files are available on Zenodo\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eVisual Assessment of Embedding Quality\u003c/p\u003e \u003cp\u003eWe visualized the embeddings of fusion-positive and fusion-negative sequences using t-SNE to qualitatively assess whether genomic foundation models (GFMs) capture features relevant to gene fusion detection. The resulting projections revealed differences in class separability across the evaluated models (see Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eClassification Performance\u003c/p\u003e \u003cp\u003eWe evaluated the efficacy of these embeddings using two classifiers, an SVM and a neural network, and compared them against the FusionAI baseline model, which is a dedicated deep learning protocol. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e summarizes the results on the full test set (~\u0026thinsp;8K samples), and Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e shows training convergence.\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eNucleotide Transformer\u003c/b\u003e achieved the highest overall performance, significantly outperforming the baseline. Using the SVM classifier, NT achieved an accuracy of 0.967 and an F1 score of 0.967, whereas \u003cb\u003eFusionAI\u003c/b\u003e achieved an accuracy of 0.894. The neural network classifier yielded nearly identical results for NT (accuracy: 0.966; ROC AUC: 0.994), demonstrating the robustness of these embeddings regardless of the classification method.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eEvo2\u003c/b\u003e was the second-best performer, consistently surpassing the FusionAI baseline. Both the support vector machine (SVM) and neural network (NN) classifiers achieved an accuracy and F1 score of 0.920 and a ROC AUC of 0.970\u0026ndash;0.975.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eHyenaDNA\u003c/b\u003e produced mixed results. When paired with a simple neural network, its performance was lower than the baseline (accuracy: 0.857 vs. 0.894). However, using an SVM improved HyenaDNA\u0026rsquo;s performance, achieving an accuracy of 0.900 and slightly surpassing the FusionAI baseline.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eDNABERT2\u003c/b\u003e failed to compete with the other foundation models or the baseline. Its accuracy ranged from 0.677 (NN) to 0.723 (SVM), and its ROC AUC was significantly lower at 0.745\u0026ndash;0.799.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparative performance of embedding models versus the reimplemented FusionAI baseline on the full test set.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eClassifier\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePrecision\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRecall\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eF1 Score\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eROC AUC\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFusionAI\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003enn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.894\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.894\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.894\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.894\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.960\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003enn\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.966\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.966\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.966\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.966\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.994\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003esvm\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.967\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.972\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.962\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.967\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.995\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eEvo\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003enn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.970\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esvm\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.920\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.975\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHyena\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003enn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.858\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.936\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esvm\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.900\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.880\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.925\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.902\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.962\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBERT\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003enn\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.677\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.678\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.677\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.676\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.745\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003esvm\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.723\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.737\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.690\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.713\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.799\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eComputational Efficiency\u003c/p\u003e \u003cp\u003eA major advantage of the GFM-based approach observed in this study is the dramatic reduction in training time. The original FusionAI protocol required approximately 40 hours to train for 1,000 epochs to ensure stability. In contrast, the foundation model workflow, comprising embedding extraction and training of the lightweight classifiers, was completed in under 10 minutes.\u003c/p\u003e\n\u003ch3\u003eSample Efficiency\u003c/h3\u003e\n\u003cp\u003eTo evaluate how well the models perform in data-scarce regimes, we analyzed learning curves across training subsets ranging from 200 to ~\u0026thinsp;36,000 samples. We calculated the number of samples required to reach 95% of the model's final accuracy (Samples@95%) and a logarithmic efficiency score (see Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eNucleotide Transformer\u003c/b\u003e demonstrated superior sample efficiency, requiring only 2,581 samples to reach 95% of its peak performance (Efficiency Score: 12.29).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eEvo2\u003c/b\u003e followed, requiring 4,461 samples (Efficiency Score: 10.94).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eHyenaDNA\u003c/b\u003e required 10,303 samples, approaching the data requirements of the baseline.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eFusionAI\u003c/b\u003e (Baseline) required 14,200 samples to reach its convergence threshold.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eDNABERT2\u003c/b\u003e was the least efficient, requiring 21,768 samples to stabilize its (comparatively lower) performance.\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eData efficiency benchmark comparing embedding models to the reimplemented FusionAI baseline (trained for 1,000 epochs).\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSamples (at 95%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy (at 95%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eEfficiency Score (at 95%)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFusionAI\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e14,200\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.849\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9.35\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eNT\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e2,581\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.918\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e12.29\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eEvo\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e4,461\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.874\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e10.94\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eHyena\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10,303\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.814\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9.27\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBert\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e21,768\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.643\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e6.78\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eTo the best of our knowledge, this study presents the first comprehensive benchmark of genomic foundation models for gene fusion detection in DNA sequences. Our results show that using pre-trained embeddings from advanced GFMs, such as the Nucleotide Transformer and Evo2, significantly improves accuracy compared to dedicated deep learning protocols like FusionAI, while requiring a fraction of the computational resources and training data.\u003c/p\u003e \u003cp\u003eThe Nucleotide Transformer produced the most distinct class separation, forming dense, non-overlapping clusters of fusion-positive and fusion-negative sequences. This high-quality representation resulted in superior classification performance (accuracy: 0.967), which remained consistent regardless of whether a simple support vector machine (SVM) or neural network was used as the classification head. Evo2 followed closely behind, consistently surpassing the baseline and demonstrating that models incorporating evolutionary information are highly effective at characterizing structural genomic events.\u003c/p\u003e \u003cp\u003eA notable finding is that DNABERT2 was unable to compete in this specific benchmark. It showed significant class overlap and lower accuracy (accuracy: 0.677\u0026ndash;0.723). This contrasts with recent literature, in which DNABERT2 and its predecessor are successful backbones for various genomic tasks\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e. For example, DNABERT2 has been adapted for viral lineage classification in ViralLM\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, bacterial genome decoding, and competing against protein language models in downstream protein tasks\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe discrepancy between these successes and our findings suggests that the utility of specific GFMs depends heavily on the task at hand. Although DNABERT2\u0026rsquo;s Byte Pair Encoding and pre-training objectives excel at capturing semantic patterns in viral or bacterial genomes, these objectives may not produce sufficiently discriminative embeddings for the specific structural features that characterize human gene fusions within a 10 kbp window without extensive fine-tuning.\u003c/p\u003e \u003cp\u003eHyenaDNA is designed to process context lengths up to one million tokens and showed mixed results. Although it outperformed the baseline when paired with an SVM, it did not achieve the same level of precision as NT or Evo2 in our \"frozen embedding\" setting. These results are similar to those from the NextVir study\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e, which benchmarked GFMs for oncoviral classification. The NextVir authors also noted that base models provided decent results with simple adapters but that maximizing performance often required fine-tuning strategies, such as Low-Rank Adaptation (LoRA), particularly for Hyena-based models.\u003c/p\u003e \u003cp\u003eHowever, HyenaDNA's potential for structural tasks remains strong. The HyenaCircle model\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e recently demonstrated that HyenaDNA-based architectures can predict extrachromosomal circular DNA (eccDNA) effectively from sequence data. Since eccDNA formation shares mechanistic similarities with gene fusions, which involve DNA breaks and re-ligation, it is likely that HyenaDNA captures the relevant signal. However, this signal is more complex for linear classifiers to disentangle than the representations produced by NT or Evo2. Furthermore, HyenaDNA\u0026rsquo;s single-nucleotide resolution has proven advantageous in viral taxonomy (e.g., ViTax)\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e, as it can handle high mutation rates better than k-mer-based approaches can. Future work on gene fusions should explore fully tuning or using LoRA adapters to unlock HyenaDNA's full potential for this task.\u003c/p\u003e \u003cp\u003eIn comparison with RNA-based approaches, it is important to contextualize our work within the broader field of fusion detection. Recent machine learning efforts have primarily focused on identifying chimeric RNAs from RNA-seq data. For instance, transformer-based classifiers for chimeric reads have achieved success with DNABERT-based architectures\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. However, our approach detects at the DNA level. This is crucial and distinct for clinical scenarios where RNA samples are degraded or unavailable. By replacing complex feature engineering with high-quality pretrained embeddings, we provide a more streamlined alternative that complements existing RNA-based methods.\u003c/p\u003e\n\u003ch3\u003eComputational and Sample Efficiency\u003c/h3\u003e\n\u003cp\u003eA critical advantage of the GFM-based workflow is its dramatic increase in efficiency. Training time decreased from about 40 hours for the FusionAI baseline to less than 10 minutes for our foundation model approach. Additionally, our analysis of sample efficiency underscores the \"few-shot\" capabilities of robust GFMs. The Nucleotide Transformer required only\u0026thinsp;~\u0026thinsp;2,500 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples. This makes GFM-based approaches particularly promising for rare disease studies or clinical scenarios where large annotated datasets are scarce.\u003c/p\u003e \u003cp\u003eIn conclusion, the landscape of genomic foundation models is rapidly expanding with the emergence of models like PathoLM\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e and Embed-Search-Align\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e for diverse tasks, and novel frameworks for unsupervised embedding evaluation\u003csup\u003e\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. However, our benchmark shows that Nucleotide Transformer and Evo2 provide the most robust \"out-of-the-box\" representations for human gene fusion detection. These models offer a superior balance of accuracy, speed, and data efficiency, paving the way for scalable, AI-driven genomic diagnostics.\u003c/p\u003e\n\u003ch3\u003eLimitations and Future Directions\u003c/h3\u003e\n\u003cp\u003eTo ensure a rigorous and reproducible comparison with established baselines, this study used the curated FusionAI benchmark dataset. This controlled environment enabled us to isolate the contribution of foundation model embeddings to classification performance rather than confounding the results with sequencing artifacts and variable coverage, which are often present in raw whole-genome sequencing (WGS) data. Although the current evaluation focused on binary classification to validate the discriminative power of these embeddings, this lays the groundwork for more granular structural analysis.\u003c/p\u003e \u003cp\u003eBuilding on these findings, our future work will extend this framework to precisely identify fusion breakpoints and type fusion partners. Breakpoint localization typically requires extensive annotated datasets to prevent overfitting, but our analysis of sample efficiency offers a promising approach. We demonstrated that models such as Nucleotide Transformer and Evo2 converge with significantly fewer samples than traditional baselines. This high learning efficiency suggests that training advanced heads for coordinate regression or token-level segmentation is computationally feasible, even with smaller, high-quality datasets available in clinical settings. Thus, we can bridge the gap between foundation models and precision diagnostics.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study presents the first comprehensive benchmark of genomic foundation models (GFMs) for the direct detection of gene fusions from DNA sequences. Our results show that using pre-trained embeddings from advanced models such as the Nucleotide Transformer and Evo2 significantly improves detection accuracy compared with specialised deep learning protocols such as FusionAI. Specifically, the Nucleotide Transformer achieved an accuracy of 0.967, followed by Evo2 with an accuracy of 0.920; both models outperformed the baseline accuracy of 0.894.\u003c/p\u003e \u003cp\u003eIn addition to superior performance, the GFM-based approach offers dramatic computational efficiency. The training time was reduced from approximately 40 hours required by the original protocol to under 10 minutes. Further analysis of sample efficiency highlighted the robustness of these models: the Nucleotide Transformer required only\u0026thinsp;~\u0026thinsp;2,600 samples to reach 95% of its peak performance, whereas the baseline method needed over 14,000 samples. This characteristic is particularly valuable for clinical applications and rare disease research, where large annotated datasets are often unavailable.\u003c/p\u003e \u003cp\u003eAlthough certain models, such as DNABERT2, were unable to compete in this specific task, the overall findings confirm that GFMs offer a scalable and highly effective alternative to traditional methods. Future work should focus on extending these capabilities and integrating these efficient workflows into clinical genomic diagnostics.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eACC: Accuracy;\u0026nbsp;\u003cbr\u003e\u0026nbsp;AUC:Area Under the Curve;\u0026nbsp;\u003cbr\u003e\u0026nbsp;BERT: Bidirectional Encoder Representations from Transformers;\u0026nbsp;\u003cbr\u003e\u0026nbsp;BPE: Byte Pair Encoding;\u0026nbsp;\u003cbr\u003e\u0026nbsp;CNN: Convolutional Neural Network;\u0026nbsp;\u003cbr\u003e\u0026nbsp;DL: Deep Learning;\u0026nbsp;\u003cbr\u003e\u0026nbsp;DNA: Deoxyribonucleic acid;\u0026nbsp;\u003cbr\u003e\u0026nbsp;FN: False Negative;\u0026nbsp;\u003cbr\u003e\u0026nbsp;FP: False Positive;\u0026nbsp;\u003cbr\u003e\u0026nbsp;GFM: Genomic Foundation Model;\u0026nbsp;\u003cbr\u003e\u0026nbsp;MCC: Matthews Correlation Coefficient;\u0026nbsp;\u003cbr\u003e\u0026nbsp;ML: Machine Learning;\u0026nbsp;\u003cbr\u003e\u0026nbsp;NLP: Natural Language Processing;\u0026nbsp;\u003cbr\u003e\u0026nbsp;NN: Neural Network;\u0026nbsp;\u003cbr\u003e\u0026nbsp;NT: Nucleotide Transformer;\u0026nbsp;\u003cbr\u003e\u0026nbsp;SVM: Support Vector Machine;\u0026nbsp;\u003cbr\u003e\u0026nbsp;TN: True Negative;\u0026nbsp;\u003cbr\u003e\u0026nbsp;TP: True Positive.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate:\u003c/strong\u003e Not applicable. This study utilizes a publicly available benchmark dataset (FusionAI)\u003csup\u003e11\u003c/sup\u003e and does not involve human participants, human data, or animal experiments.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication:\u003c/strong\u003e Not applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials:\u003c/strong\u003e The datasets analyzed during the current study and learned models are available in the Zenodo repository: https://zenodo.org/records/17898581. The source code is available at\u0026nbsp;\u003cbr\u003ehttps://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark\u003cu\u003e\u0026nbsp;.\u003c/u\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests:\u003c/strong\u003e The authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u003c/strong\u003e This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS25/188/OHK4/3T/17.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions:\u003c/strong\u003e RK conceived and designed the study, analyzed and interpreted the data, created the software used in the work, and drafted the manuscript. MK analyzed and interpreted the data and drafted the manuscript. BD acquired the data, created the software used in the work, and substantively revised the manuscript. KK acquired and analyzed the data and created the software used in the work. OD conceived and designed the study and substantively revised the manuscript. All authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements:\u003c/strong\u003e Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eFoltz SM, Gao Q, Yoon CJ, et al. Evolution and structure of clinically relevant gene fusions in multiple myeloma. Nat Commun. 2020;11:2666.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu SV, Nagasaka M, Atz J, Solca F, M\u0026uuml;llauer L. Oncogenic gene fusions in cancer: from biology to therapy. Signal Transduct Target Ther. 2025;10:111.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBao Z, Chai R, Liu X, Wang J. Fusion genes as diagnostic and predictive biomarkers for tumor. Glob Transl Med. 2022;1:1\u0026ndash;12.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhmed J, Torrado C, Chelariu A, Kim S-H, Ahnert JR. (2024) Fusion Challenges in Solid Tumors: Shaping the Landscape of Cancer Care in Precision Medicine. JCO Precis Oncol e2400038.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCreason A, Haan D, Dang K, et al. A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery. Cell Syst. 2021;12:827\u0026ndash;e8385.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eUhrig S, Ellermann J, Walther T, et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 2021;31:448\u0026ndash;60.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHaas BJ, Dobin A, Li B, Stransky N, Pochet N, Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20:213.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eApostolides M, Jiang Y, Husić M, Siddaway R, Hawkins C, Turinsky AL, Brudno M, Ramani AK. MetaFusion: a high-confidence metacaller for filtering and prioritizing RNA-seq gene fusion candidates. Bioinformatics. 2021;37:3144\u0026ndash;51.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJin Z, Huang W, Shen N, Li J, Wang X, Dong J, Park PJ, Xi R. Single-cell gene fusion detection by scFusion. Nat Commun. 2022;13:1084.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHsieh G, Bierman R, Szabo L, Lee AG, Freeman DE, Watson N, Sweet-Cordero EA, Salzman J. Statistical algorithms improve accuracy of gene fusion detection. Nucleic Acids Res. 2017;45:e126\u0026ndash;126.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKim P, Tan H, Liu J, Kumar H, Zhou X. FusionAI, a DNA-sequence-based deep learning protocol reduces the false positives of human fusion gene prediction. STAR Protoc. 2022;3:101185.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLovino M, Urgese G, Macii E, Di Cataldo S, Ficarra E. A Deep Learning Approach to the Screening of Oncogenic Gene Fusions in Humans. Int J Mol Sci. 2019;20:1645.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLovino M, Ciaburri MS, Urgese G, Di Cataldo S, Ficarra E. DEEPrior: a deep learning tool for the prioritization of gene fusions. Bioinformatics. 2020;36:3248\u0026ndash;50.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChing T, Himmelstein DS, Beaulieu-Jones BK, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15:20170387.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20:389\u0026ndash;403.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGuo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. Natl Sci Rev. 2025;12:nwaf028.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJi Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112\u0026ndash;20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol. 2024;22:126.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang C, He Z, Jia R, Pan S, Coin LJ, Song J, Li F. PLANNER: A Multi-Scale Deep Language Model for the Origins of Replication Site Prediction. IEEE J Biomed Health Inf. 2024;28:2445\u0026ndash;54.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang K, Zeng X, Zhou J, Liu F, Luan X, Wang X. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform. 2024;25:bbae195.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLong W, Su H, Xiong J, Zhang Y. MutBERT: probabilistic genome representation improves genomics foundation models. Bioinformatics. 2025;41:i294\u0026ndash;303.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDalla-Torre H, Gonzalez L, Mendoza-Revilla J, et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2025;22:287\u0026ndash;97.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiang L, Chen Y, Wang T et al. (2024) Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2024.07.18.24310666\u003c/span\u003e\u003cspan address=\"10.1101/2024.07.18.24310666\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNguyen E, Poli M, Faizi M et al. (2023) HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.48550/ARXIV.2306.15794\u003c/span\u003e\u003cspan address=\"10.48550/ARXIV.2306.15794\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBrixi G, Durrant MG, Ku J et al. (2025) Genome modeling and design across all domains of life with Evo 2. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2025.02.18.638918\u003c/span\u003e\u003cspan address=\"10.1101/2025.02.18.638918\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKrupicka R. (2025) Embedding and benchmarks results for fusionAI dataset. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/ZENODO.17898581\u003c/span\u003e\u003cspan address=\"10.5281/ZENODO.17898581\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAkotenou G, El Allali A. Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification. Brief Bioinform. 2025;26:bbaf311.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePeng C, Shang J, Guan J, Wang D, Sun Y. ViraLM: empowering virus discovery through the genome foundation model. Bioinformatics. 2024;40:btae704.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBoshar S, Trop E, De Almeida BP, Copoiu L, Pierrot T. Are genomic language models all you need? Exploring genomic language models on protein downstream tasks. Bioinformatics. 2024;40:btae529.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRobertson J, Consul S, Vikalo H. NextVir: Enabling classification of tumor-causing viruses with genomic foundation models. PLOS Comput Biol. 2025;21:e1013360.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLi F, Lu W, Bai Y. HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction. Front Genet. 2025;16:1641162.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe Y, Zhou F, Bai J, Gao Y, Huang X, Wang Y. ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model. Brief Bioinform. 2024;26:bbaf041.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBonizzoni P, De Felice C, Pirola Y, Rizzi R, Zaccagnino R, Zizza R. Identification of Chimeric RNAs: A Novel Machine Learning Perspective. In: Bansal MS, Chen W, Khudyakov Y, Măndoiu II, Moussa MR, Patterson M, Rajasekaran S, Skums P, Thankachan SV, Zelikovsky A, editors. Comput. Adv. Bio Med. Sci. Cham: Springer Nature Switzerland; 2025. pp. 14\u0026ndash;26.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDip SA, Shuvo UA, Chau T, Song H, Choi P, Wang X, Zhang L. (2024) PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1101/2024.06.18.599629\u003c/span\u003e\u003cspan address=\"10.1101/2024.06.18.599629\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHolur P, Enevoldsen KC, Rajesh S, Mboning L, Georgiou T, Bouchard L-S, Pellegrini M, Roychowdhury V. Embed-Search-Align: DNA sequence alignment using Transformer models. Bioinformatics. 2025;41:btaf041.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAwasthi R, Mend Mend Arachchige GS, Zhu X. Unsupervised evaluation of pre-trained DNA language model embeddings. BMC Genomics. 2025;26:710.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"biodata-mining","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bidm","sideBox":"Learn more about [BioData Mining](http://biodatamining.biomedcentral.com/)","snPcode":"13040","submissionUrl":"https://submission.nature.com/new-submission/13040/3","title":"BioData Mining","twitterHandle":"@BioMedCentral","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"genomic foundation models, gene fusion detection, DNA sequence analysis, transformer models, Nucleotide Transformer, Evo2, bioinformatics benchmarking","lastPublishedDoi":"10.21203/rs.3.rs-8360344/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8360344/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cb\u003eBackground\u003c/b\u003e\u003c/p\u003e \u003cp\u003eGene fusions are critical drivers of oncogenesis and diagnostic biomarkers in various cancers. However, their detection from RNA or DNA sequencing, when performed using traditional analytical methods, encounters challenges related to sample quality, computational complexity, and noise. Although deep learning is more robust, it usually requires large labeled datasets and substantial training resources. Genomic foundation models (GFMs), which are pre-trained on pangenome-scale data, offer a promising solution to these issues.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMethods\u003c/b\u003e\u003c/p\u003e \u003cp\u003eThis study presents the first comprehensive benchmark of four transformer-based GFMs, Nucleotide Transformer, Evo2, HyenaDNA, and DNABERT2, for gene fusion detection. Using the curated FusionAI dataset of ~\u0026thinsp;52,000 sequences, we extracted embeddings from 10-kilobase-pair (kbp) DNA sequences surrounding fusion breakpoints. We evaluated the quality of these representations qualitatively using t-SNE visualization and quantitatively by training lightweight classifiers (Support Vector Machines and simple Neural Networks) on the fixed embeddings.\u003c/p\u003e\u003cp\u003e\u003cb\u003eResults\u003c/b\u003e\u003c/p\u003e \u003cp\u003eThe Nucleotide Transformer achieved the best performance with an accuracy of 0.967 and an F1 score of 0.967. This result outperformed the dedicated deep learning baseline (FusionAI, with an accuracy of 0.894). Evo2 was the second-best performer (accuracy: 0.920), demonstrating robustness derived from evolutionary pretraining. Conversely, DNABERT2 failed to compete (accuracy 0.677\u0026ndash;0.723). Furthermore, sample efficiency analysis revealed that the Nucleotide Transformer required only\u0026thinsp;~\u0026thinsp;2,600 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples.\u003c/p\u003e\u003cp\u003e\u003cb\u003eConclusions\u003c/b\u003e\u003c/p\u003e \u003cp\u003eThese findings demonstrate that advanced GFMs, particularly the NT and Evo2 models, generate highly discriminative 'out-of-the-box' embeddings. These embeddings significantly outperform dedicated deep learning baselines while requiring a fraction of the training data and computational time. This suggests that GFMs could be a scalable, data-efficient way of developing precise genomic diagnostic tools, particularly for rare diseases.\u003c/p\u003e","manuscriptTitle":"Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-23 17:03:39","doi":"10.21203/rs.3.rs-8360344/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-01-13T02:09:09+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-03T18:18:16+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-31T01:06:27+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-30T08:09:27+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"92725873313449640778479819864132855256","date":"2025-12-25T16:49:54+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"115100208620248992772423001252785148918","date":"2025-12-23T12:47:12+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-23T06:30:24+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"154001903470654834587488425591155059384","date":"2025-12-22T00:39:36+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"274721611379329050384334929820236368342","date":"2025-12-22T00:03:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"281712973549824241446357181550325279039","date":"2025-12-21T23:50:05+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"179070121154106481413816641761486264388","date":"2025-12-20T01:28:28+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-19T17:45:30+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-19T17:37:00+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-17T01:47:18+00:00","index":"","fulltext":""},{"type":"submitted","content":"BioData Mining","date":"2025-12-14T21:38:36+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"biodata-mining","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bidm","sideBox":"Learn more about [BioData Mining](http://biodatamining.biomedcentral.com/)","snPcode":"13040","submissionUrl":"https://submission.nature.com/new-submission/13040/3","title":"BioData Mining","twitterHandle":"@BioMedCentral","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"eeda42d5-91d4-428b-850a-becc136ac26d","owner":[],"postedDate":"December 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-04-13T15:59:44+00:00","versionOfRecord":{"articleIdentity":"rs-8360344","link":"https://doi.org/10.1186/s13040-026-00553-1","journal":{"identity":"biodata-mining","isVorOnly":false,"title":"BioData Mining"},"publishedOn":"2026-04-06 15:57:09","publishedOnDateReadable":"April 6th, 2026"},"versionCreatedAt":"2025-12-23 17:03:39","video":"","vorDoi":"10.1186/s13040-026-00553-1","vorDoiUrl":"https://doi.org/10.1186/s13040-026-00553-1","workflowStages":[]},"version":"v1","identity":"rs-8360344","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8360344","identity":"rs-8360344","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00