DeepCis: A Transformer-based Multi-Omics Framework for Cross-Scale Explainability in Gene Regulatory Element Identification

preprint OA: closed
Full text JSON View at publisher
Full text 147,211 characters · extracted from preprint-html · click to expand
DeepCis: A Transformer-based Multi-Omics Framework for Cross-Scale Explainability in Gene Regulatory Element Identification | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article DeepCis: A Transformer-based Multi-Omics Framework for Cross-Scale Explainability in Gene Regulatory Element Identification Xinrong Li, Wei Feng, Junjie Duan, Yujie Liu, Yusen Zhao, Shuangshuang Wang, and 8 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8988269/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 6 You are reading this latest preprint version Abstract Dysregulation of gene expression is a hallmark of cancer, wherein cis-regulatory elements(CREs) in non-coding genomic regions play pivotal roles in precise transcriptional control. Although numerous candidate CREs(cCREs) have been identified, the complex regulatory relationships between cCREs and their target genes remain poorly characterized, and current computational approaches are inadequate for large-scale prediction of functional cCREs. Here, we introduce DeepCis, a Transformer-based multi-omics framework that integrates base-level genomic sequence and epigenomics features, as well as region-level 3D chromatin organization to capture regulatory information across multiple scales. DeepCis quantitatively assesses the contribution of individual cCREs to gene regulation and reveals the cooperative interactions among cCREs. Our framework demonstrates robust performance in gene expression prediction, effectively identifying key cCREs and their specific contributions to transcriptional output. The model captures both proximal dominance and distal activity of cCREs and elucidates how 3D chromatin organization constrains gene regulation. Furthtermore, DeepCis successfully identifies transcription factor binding sites associated with active regulation. Through comprehensive multi-level interpretability analyses, we validate the reliability and effectiveness of the DeepCis in deciphering gene regulation mechanisms, providing a powerful foundation for advancing our understanding of transcripgional regulation and enabling applications in precision medicine. Biological sciences/Computational biology and bioinformatics Biological sciences/Genetics DeepCis Cis-regulatory elements Multi-omics Gene regulation Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. Introduction Gene expression dysregulation represents a fundamental characteristic of cancer pathogenesis. CREs including promoters, enhancers, silencers, and insulators—reside in non-coding genomic regions and precisely orchestrate transcription by recruiting transcription factors (TFs) and chromatin-associated proteins in a spatiotemporal manner 1 , 2 , 3 . Disruption of these CREs through genetic mutations or epigenetic alterations can lead to aberrant transcriptional programs and cancer progression 4 , 5 . Gene regulation is inherently multifactorial, involving synergistic contributions from DNA sequence, chromatin accessibility, histone modifications, and higher-order 3D chromatin organization 6 , 7 . Recent advances in high-throughput sequencing and large-scale consortia such as ENCODE and 4D Nucleome have identified millions of cCREs 8 . However, determining which cCREs functionally influence specific genes, and quantifying their individual contributions to transcriptional output, remains a fundamental challenge in genomics 9 , 10 . Experimental approaches such as CRISPR-Cas9 perturbations can validate regulatory activity at individual loci with high precision, but suffer from limitions in scalability, cost, and throughput, and fail to capture cooperative or compensatory interactions among cCREs across large genomic distances 11 . Chromatin conformation techniques (Hi-C 12 , ChIA-PET 13 , HiChIP 14 , and PLAC-seq 15 ) provide physical contact maps between regulatory regions and promoters, but physical proximity does not necessarily equate to functional regulation. Computational models have attempted to infer enhancer-gene relationships from sequence, chromatin accessibility, or 3D chromatin organization. The Activity-by-Contact (ABC) model conceptualizes enhancer influence as the product of enhancer activity and contact frequency but assumes linear additivity and may fail to account for non-linear cooperative regulation 16 . Deep learning models trained on binary enhancer-promoter interactions or CRISPR screens are constrained by limited positive samples and poor cross-cell-type generalizability 17 – 19 . Basenji2 employs convolutional neural networks to capture local dependencies, but it remains limited in scope and may not effectively capture long-range dependencies 20 . Enformer captures long-range interactions from sequence alone but struggles with consistent performance in unseen cellular contexts 21 . GraphReg integrates chromatin interactions with multi-omics features but retains bias toward local connectivity 22 . CREaTor introduces zero-shot modeling of regulatory patterns from sequence or epigenomics but lacks integration of spatial chromatin organization and provides insufficient clarity explaining the biological interpretability of key cCREs 23 . Thus, a critical unmet need exists for a unified, quantitative, and interpretable framework that (i) integrates genomic, epigenomic, and 3D chromatin organization across multiple scales, (ii) operates without predefined positive or negative regulatory labels, (iii) quantifies individual and cooperative contribution of each cCRE to gene expression, and (iv) maintains biological interpretability. Here, we present DeepCis, a Transformer-based multi-omics framework that integrates DNA sequence, base-resolution epigenomic signals, and condensed spatial chromatin organization features derived from Hi-C data. DeepCis employs a zero-shot learning strategy and a two-stage encoders architecture to capture both local sequence determinants and global cCRE-gene interactions. Crucially, we introduce a Regulatory Contribution Score (RCS) to quantitatively assess each cCRE’s contribution to transcriptional output. Through multi-level interpretability analyses, we demonstrate that DeepCis robustly identifies key cCREs across cell types, captures both proximal dominance and distal enhancer activity, identifies transcription factor binding sites and cooperative regulation, reveals 3D chromatin constraints on gene regulation, and discovers hub cCREs overlapping super-enhancers. Collectively, DeepCis provides a scalable and interpretable strategy for decoding cis-regulatory architectures, offering mechanistic insights and potential applications in disease research and precision medicine. 2. Results 2.1. Overview of the DeepCis Framework DeepCis comprises two cascaded Transformer encoders that integrate genomic sequence, epigenomic signals, and 3D chromatin organization features to quantitatively model cCRE-gene regulatory relationships and coordinative interactions among cCREs in a supervised learning framework. Initially, the base information of cCREs is processed through numerical and one-hot encoding. Epigenomic signals are then incorporated to extract base-level signal features, which are then normalized accordingly. For 3D chromatin organization, we construct cCREs contact matrices and perform singular value decomposition (SVD) for dimensionality reduction, obtaining condensed spatial features for each cCRE. This process preserves core spatial information, capturing the relative positional relationships and synergistic regulatory concacts of cCREs in 3D chromatin organization (Fig. 1 a, Methods). The DeepCis architecture(Fig. 1 b, Methods) incorporates four core modules: input module, feature mapping module, feature fusion encoding module, and the interaction representation encoding module. This architecture enables extraction and fusion of complementary information types to decipher complex gene regulatory mechanisms. The feature mapping module processes each omics data type, converting raw features into high-dimensional representations suitable for integration. The feature fusion encoding module processes cCREs and target gene features, focusing on cross-scale feature fusion to generate abstract representations for each cCRE or gene. We employ distinct fusion strategies for features at different granularities. At the base pair level, DeepCis concatenates encoded genomic sequence features with corresponding epigenomic features, as they provide complementary information, with genomic sequence suppling genetic encoding information while epigenomic features indicating active state and regulatory potential. This concatenated feature is downsampled and passed through the first encoder to obtain initial representations of cCREs and gene regions. Relative positional bias is incorporated to capture relative base positions within the sequence, enhancing the model's understanding of internal sequence information. Spatial embedding representations (Methods), are then fused with the initial representation leveraging the complementary nature of spatial, sequence, and epigenomic information. The interaction representation encoding module integrates features from each target gene and its corresponding cCREs for global regulation, modeling relationships between different cCREs and between cCREs and genes. This module identifies key functional cCREs (enhancers and promoters) and transcription factor binding sites by quantifying cCREs contributions through the Regulatory Contribution Score (RCS) (Methods), providing a basis for biological interpretation. Based on abstract features extracted in previous modules, features of cCREs related to the target gene are integrated with the gene’s features to create a unified representation optimized for predicting gene expression. To capture precise regulatory relationships, we introduce dual masks based on biological priors (Methods).All DeepCis modules are jointly trained in an end-to-end framework. DeepCis offers several advantages: First, it integrates comprehensive multi-omics data (genomic sequences, epigenomic signals, and 3D chromatin organization), enhancing accuracy and richness of cCREs representations. Second, it employs a flexible feature fusion mechanism that adapts strategies based on data scale to effectively combine various omics features, improving the model expressiveness. Through spatial representations, DeepCis captures the relative positioning of cCREs in 3D chromatin organization, identifying potential synergistic regulatory relationships. The multi-head attention mechanism of Transformer enables capture of long-range interactions between cCREs and between cCREs and genes. The model also introduces dual mask constraints (Methods). Finally, DeepCis excels in identifying key CREs, recognizing known enhancers, promoters, and transcription factor binding sites, while uncovering potential novel cCREs, providing insights into gene regulatory mechanisms (Fig. 1 c). 2.2. Robust gene expression prediction by DeepCis We evaluated DeepCis performance across cell lines and chromosomes. Independent tests were conducted on the test set from chromosome 6 in four cell lines and the unseen HCT116 cell line. DeepCis achieved an overall Pearson correlation coefficient of 0.752 on the test set, with a mean absolute error of 0.708 (Fig. 2 b). In the K562 cell line, the Pearson correlation for chromosome 6 reached 0.795(Fig. 2 c). In the unseen HCT116 cell line, the average Pearson correlation across all chromosomes was 0.720, with the highest correlation of 0.747 on chromosome 22 (Fig. 2 d). These results demonstrate strong generalization capability of DeepCis across different biological contexts. To assess advantages of DeepCis, we first tested models trained with single-omic data on the test set. Pearson correlation coefficients for single-omic models were consistently lower than those of DeepCis (Supplementary Fig. S1 a). Notably, Pearson correlation coefficients for single-omic models all exceeded 0.50, indicating that each individual omic data type contains significant predictive information for gene expression, validating the necessity and effectiveness of multi-omics itegration. We compared different dimensionalities of condensed spatial features and found that a 32-dimensional features achieved the optimal balance between retaining key information and reducing training complexity (Fig. 2 b). Different fusion strategies were compared, with concatenating genomic and epigenomic features before adding spatial embedding features yielding the highest model accuracy (Fig. 2 a). Model predictive performance remained stable even after removing subsets of features, with only minor changes in Pearson correlation coefficient (Fig. 2 e, Supplementary Fig. S1 b), indicating robustness in handling epigenomic features. 2.3. Quantification of cCREs contributions and identification of key cCREs by DeepCis To validate whether high-RCS cCREs predicted by DeepCis correspond to key cCREs in actual regulatory processes, we examined enrichment of enhancers and promoters in high and low regulatory contribution groups (Methods). We obtained known enhancer data from FANTOM5 database for GM12878, HepG2, K562, and HeLaS3 cell lines and identified promoter regions (TSS ± 1 kb) for each gene. Results show significant enrichment of enhancers and promoters in the high regulatory contribution group (p < 0.001)(Fig. 3 a), with average fold enrichment of 2 in high-RCSs group compared to low-RCSs group (Fig. 3 b).This confirms that the attention scores learned by DeepCis strongly correlated with biological functions of gene regulation, providing a solid basis for identifying key cCREs. To provide direct functional evidence, we evaluated DeepCis on an established benchmark dataset of CRISPR/Cas9-validated enhancers and non-enhancers from K562 cell line, assessing the association between RCSs and functional cCREs, with performance quantified by the AUC. Prior to evaluation, we normalized predicted RCSs within the same gene to eliminate inter-gene variability and highlight relative regulatory strength. Results show close alignment between model-predicted RCSs and functional enhancers, with AUC of 0.88 (95% CI: 0.87–0.89) (Fig. 3 c). Model performance across different chromosomes showed highest AUC reaching 0.96 (chr4) (Supplementary Fig. S2a), demonstrating effective identification of functionally relevant cCREs. We further tested the selected optimal epigenomic feature combinations (Fig. 3 d, Methods), integrating DNA sequence and spatial features. Performance remained stable and high across different numbers of retained epigenomic features (AUC: 0.86–0.90) (Fig. e-i), demonstrating that DeepCis is robust and not heavily dependent on epigenomic features. Notably, H3K79me2, POLR2A, and DNase emerged as key markers influencing predictive accuracy of gene expression regulation in our model (Supplementary Fig. S2b, Methods), with performance improving as more feature combinations were included in selection process (Supplementary Fig. S2c). DNase hypersensitive regions mark open chromatin states, indicating enhanced accessibility for transcription factor binding. Previous studies established H3K79me2 functional regulatory role through co-transcriptional splicing mechanism 24 . The importance of POLR2A in cancer is well-established-identified as a "synthetic vulnerability" target in triple-negative breast cancer 25 and driving abnormal tumor cell proliferation in esophageal cancer in cooperation with Bclaf1 26 . Comparison of DeepCis with other methods for identifying key cCREs demonstrated superior performance of DeepCis (Fig. 3 j, Methods). 2.4. DeepCis captures proximal dominance and distal activity of cCREs Previous studies established that regulatory efficacy of cCREs is significantly influenced by proximity to target genes, exhibiting clear distance effect 27 – 29 . To systematically evaluate whether DeepCis captures distance dependency of cCREs regulation on gene expression, we analyzed relationship between RCSs of cCREs learned by the model and distance to the target gene across five cell lines. Results consistently showed significant negative correlation between RCSs and distance to target gene (Spearman correlation coefficient from − 0.723 to -0.698, p < 0.01) (Fig. 4 a), indicating that the model captures distance decay effects commonly observed in genomic regulation, consistent with prior studies. However, accumulating evidence supports critical roles of distal enhancers in complex gene regulatory networks 30 , 31 . To assess the model’s ability to identify distal regulation, we further analyze types of cCREs in high regulatory contribution group. In GM12878 cell line, proximal enhancer-like signatures (pELS) accounted for 32.2%, promoter-like signatures (PLS) for 19.0%, distal enhancer-like signatures (dELS) for 15.6%, and CTCF-only and CTCF-bound cCREs for 12.6%. To minimize potential for random variation, we validated results across all chromosomes in each cell line, observing consistent trends(Fig. 4 b, Supplementary Fig. S3a-S3d). These findings demonstrate the model integrates both proximal regulatory dominance and key long-range regulatory effects, overcoming simple distance bias. 2.5. DeepCis accounts for 3D chromatin organization impact on gene regulation The genome is organized in intricate three-dimensional structures through complex chromatin folding, imposing critical constraints on precise gene regulation 6 . Topologically Associating Domains (TADs) represent fundamental functional units in the genome 3D organization, where cCREs within same domain interact more frequently, while across-TAD interactions are relatively rare 32 . We systematically analyzed high regulatory contribution group (Methods) located within same or different TADs (Methods)from their target genes. Results show RCSs of cCREs within same TAD as target gene are significantly higher than those located in different TADs (Mann-Whitney U test, p < 0.0001). This significant TAD co-localization between high contribution cCREs and genes was consistently observed across the analyzed cell lines (Fig. 4 c, Supplementary Fig. S3e), strongly suggesting that DeepCis effectively accounts for the impact of the 3D chromatin organization on gene regulation. 2.6. DeepCis identifies core transcription factor binding sites associated with active regulation To investigate the association between high RCSs cCREs and the binding of active TFs, we integrated ChIP-seq data for 108 known TFs across five cell lines with target gene information (Methods). We analyzed the correlation between RCSs predicted by DeepCis and enrichment of transcription factor binding sites for these target genes. Comparison of TFs binding site counts between high and low regulation contribution groups (Methods) revealed significant enrichment of 87 TFs on high RCSs cCREs (Fisher’s Exact Test, p < 0.05), with an average enrichment factor of 2.393 (Fig. 5 a, Supplementary Fig. S4a), demonstrating that DeepCis effectively identifies core regulatory regions highly associated with transcription factor binding. Notably, some TFs were not enriched in high-contribution cCREs, possibly due to shared regulation of adjacent genes by neighboring cCREs. Additionally, TFs such as CUX1 and FOXP1, implicated in transcriptional repression in existing studies 33 , 34 , may suppress gene expression or maintain dormant state upon DNA binding, potentially explaining the lower contribution scores for their associated cCREs. To quantify the relationship between RCSs of cCREs and TF activity, we focused on the cCREs overlapping with TF binding peaks. We analyzed the correlation between RCSs of these cCREs and corresponding TF ChIP-seq signal intensities. Two distinct correlation patterns emerged at gene level: first group showed significant positive correlation between RCSs of cCREs and TF binding signal intensity, with an average Spearman correlation coefficient of 0.7, representing activation mode(Fig. 5 b). Second group exhibited negative correlation, with an average Spearman coefficient of -0.77 (Fig. 5 b), indicating regulation by transcriptional repression mechanisms. We validated this correlation using tissue-specific data and corresponding cell lines, finding DeepCis predictions consistent with observed directions (Fig. 5 c, Supplementary Fig. S4b, Methods). For instance, activation of ROCK2 by transcription factor NRF1 in the K562 cell line showed significant positive correlation (Spearman correlation = 0.655, p = 0.001), orroborated in acute myeloid leukemia (LAML) cohort (Spearman correlation = 0.372, p = \(2.40 \times 10{\text{e-}}6\) ). Conversely, repression of TYROBP by NRF1 in K562 showed a significant negative correlation (Spearman correlation = -0.354, p = 0.004), supported by LAML data (Spearman correlation = -0.240, p = 0.003)(Fig. 5 c, 5 d). However, Negative correlations may reflect complex inhibitory mechanisms, such as competitive binding or recruitment of chromatin remodeling complexes that indirectly suppress gene expression. Overall, these findings suggest that DeepCis captures multi-level regulatory patterns. 2.7. DeepCis predicts hub CREs in gene regulatory systems Beyond quantifying regulatory relationships between genes and cCREs, DeepCis quantifies the synergistic interactions between cCREs, revealing global regulatory architectures. We hypothesize that cCREs frequently forming high-intensity synergistic relationships with other cCREs in different gene regulatory networks may serve as core hubs genome-wide. To test this, we summarized and analyzed the high regulatory contribution group (Methods) identified for each gene as high contribution cCREs to identify key nodes with global influence. Genome-wide statistical analysisrevealed that only a few cCREs were repeatedly selected across different gene regulatory networks with high occurrence frequency. Some high-frequency cCREs overlapped known super-enhancer regions (Supplementary Table 2), critical for regulating genes involved in cell identity, development, and disease processes 35 , 36 . For example, in K562, chr9:35732261–35732610 region overlapped with a super-enhancer associated with key genes including CREB3 (transcription factor involved in stress response and cell growth regulation 37 ), TLN1 (key molecule in cell adhesion, migration, and signal transduction 38 ), and CA9 (carbonic anhydrase family member promoting tumor growth and metastasis in acidic tumor environments 39 ). In the HCT116, chr1:1308334–1308678 region also overlapped a super-enhancer associated with key genes including ISG15 (important in antiviral immunity 40 ), and ATAD3A (related to mitochondrial function 41 ) (Fig. 5 d). These findings suggest that these cCREs function as key "hub" cCREs orchestrating regulation of multiple genes through extensive interactions, providing new insights into complex epigenetic regulatory network. 2.8. Causal impact of key cCREs confirmed by perturbation analysis To confirm causal role of high RCS cCREs identified by the model in gene expression regulation, we performed progressive genome perturbation experiments in each cell line across chromosomes. Specifically, for each gene’s associated cCREs, we simulated knockout of 5%, 10%, 15%, 20%, 25%, and 30% of high RCS cCREs, ranked high to low by RCS, comparing the results to knockouts of the same proportions from low RCS cCREs. We then predicted gene expression following these perturbations.The perturbation analysis demonstrated decisive impact of high RCS cCREs on gene expression: removing low RCS cCREs did not significantly affect gene expression correlation, indicating minimal contribution. In contrast, removal of high RCSs cCREs caused significant drop in prediction correlation across all chromosomes (Fig. 6 a). Significant difference between high and low RCS cCREs was observed (Wilcoxon test, p < 0.0001), confirming that the high RCS cCREs identified by the model have direct causal impact on gene expression (Fig. 6 a). For further validation, we examined the changes in gene expression values. From unseen HCT116 cell line, we selected 100 genes with the smallest prediction error on chromosome 6 as stable genes. Gradually removing 10%, 20%, and 30% of high RCS cCREs resulted in a noticeable decrease in the predicted expression values of most stable genes (Fig. 6 b). EEF1E1 42 is a key factor in protein synthesis with a critical role in cellular physiology and pathology and showed significantly decreased expression levels when cCREs with top 10% RCSs were knocked out. Notably, these top 10% cCREs were predominantly located in enhancer and promoter regions.This indicates correlation reduction reflects the substantive contribution of key cCREs identified by DeepCis, consistent with experimental findings where enhancer knockouts decrease target gene expression. This perturbation experiment provides strong evidence for the causal impact of key cCREs and highlights the model’s ability to accurately identify core functional cCREs in complex gene regulatory networks. 3. Conclusion This study presents DeepCis, a novel framework for understanding gene expression regulation mechanisms through integration genomic, epigenomic, and 3D chromatin organization. By combining data across different scales, we capture the complex nature of gene regulation and identify potential CREs and their modes of action. Introduction of "regulatory contribution score (RCS)" as a quantitative measure enables clear assessment of each regulatory element’s contribution to gene expression, enhancing the model interpretability and providing valuable biological insights into gene regulation mechanisms. Using cCREs as a basic unit and constructing their contact matrices with dimensionality reduction strategy effectively reduces the computational burden when processing large-scale Hi-C data, improving the analysis efficiency. In capturing regulatory patterns, the multi-head attention mechanism captures both traditional short-range regulatory effects and the influence of distal cCREs. It successfully reveals the constraints of chromatin organization on gene regulation, indicating that the model understands not only simple linear distances but also how complex spatial organization restricts gene regulation. DeepCis associate high RCSs cCREs with transcription factor binding sites, revealing which TFs play key roles in activation or repression of gene expression. This provides in-depth insights into how TFs regulate gene expression under different conditions, helping identifying and understanding critical TFs. DeepCis also identifies core regulatory hub cCREs overlapping super-enhancer regions that frequently appear in multiple gene regulatory networks and regulate the expression of key genes. Super-enhancers play a crucial role in cell identity, development, and disease processes. Therefore, these findings provide new insights into function of super-enhancer functions in gene expression regulation and offer important clues for future epigenetic research. We validated the biological functions of key cCREs through enrichment analysis and CRISPR experiments, providing comprehensive biological validation of the model effectiveness and reliability. Additionally, we confirmed the key role of high RCS cCREs predicted by the model in gene expression regulation through genomic perturbation experiments from computational perspective. These combined validation results demonstrate that the model achieves high accuracy at both biological and computational levels, providing strong theoretical and experimental support for deeper understanding of gene regulation mechanisms. Despite the progress in improving prediction accuracy and interpretability of gene regulation mechanisms, this study still has certain limitations. First, while dimensionality reduction reduces the computational burden of Hi-C data through, processing extremely large datasets and higher-resolution Hi-C data may still face computational resources and time limitations. Second, although our model performs well in capturing long-range interactions between cCREs, in some cases, particularly with complex regulatory networks, the model may not fully capture all subtle interactions and regulatory mechanisms. Future research could further optimize the model structure to enhance handling of complex regulatory networks. In summary, the predictive and interpretive capabilities of the DeepCis offer broad potential applications in clinical medicine, particularly in precision medicine and disease diagnosis. The model can accurately analyze the impact of non-coding region cCREs on gene regulation, helping to unravel how genes are influenced by multiple regulatory factors and capturing their dynamic interactions, which can assist clinicians in better understanding the mechanisms of diseases. By learning gene regulatory patterns in specific disease states and normal cell lines, the model can identify key cCREs driving diseases. These key cCREs help reveal the underlying regulatory mechanisms behind gene expression abnormalities and can serve as potential drug targets, enabling precise targeted therapies. Despite some limitations, with the continuous accumulation of data and optimization of algorithms, the accuracy and applicability of the model will be further enhanced, offering more personalized and efficient solutions in precision medicine and advancing medical technology. 4. Methods 4.1. Data collection and preprocessing Data were obtained from publicly available databases, including candidate cis-regulatory elements (cCREs) in five human cell lines (GM12878, HepG2, K562, HeLaS3 and HCT116), chromatin accessibility and epigenomic modification datasets, 3D chromatin interaction maps(Hi-C), and transcription factor ChIP-seq peaks for each cell line. The details are as follows: 4.1.1. Selection of cCREs. To derive a high-confidence dataset of cCREs from the SCREEN ( https://screen.encodeproject.org/ ) database and to enhance computational efficiency in subsequent analyses, we applied a filtering strategy. Specifically, two classes of potentially low-quality cCREs were removed: (i) Low_DNase cCREs, which may correspond to non-functional open chromatin regions; and (ii) DNase_only cCREs, characterized by DNase accessibility in the absence of corroborating epigenomic modifications. 4.1.2. DNA sequence extraction. Genomic sequences of cCREs were extracted from the GRCh38 reference genome using bedtools getfasta (v2.31.1). For compatibility with the deep learning model, nucleotide bases were numerically encoded (A = 1, T = 2, C = 3, G = 4, N = 5). All extracted sequences were subsequently padded to a uniform length of 350 bp (P = 0), corresponding to the maximum cCREs length across cell types. 4.1.3. ChIP-seq data processing. In this study, ChIP-seq data corresponding to 16 different assays for cCREs (CTCF, DNase, EP300, H2AFZ, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1, POLRZA, POLR2AphosphoS5) (Supplementary Table 1) were downloaded from the ENCODE database( https://www.encodeproject.org/ ) and processed to extract signal features and perform normalization. First, based on the genomic coordinates of the cCREs, signal values at single-base resolution were extracted from BigWig files using the pyBigWig library, generating a site-level signal matrix. The signal results from all chromosomes were then merged, followed by a log(1 + x) transformation. Subsequently, the features were normalized to the range [0, 1] using Min-Max scaling, preparing the data for downstream feature analysis and modeling. 4.1.4. Hi-C data processing. Hi-C data from the 4D Nucleome( https://4dnucleome.org/ ) and ENCODE databases (Supplementary Table 1) were processed according to a standardized protocol. For the raw .hic files from the 4D Nucleome, we used the hic2cool (v0.1.1) tool to directly convert them into .cool format files with a 10 kb resolution. For the raw .hic files from ENCODE, we first extracted the 10 kb raw contact frequency matrices using the hicstraw library, and then generated .cool files at the same 10 kb resolution using the cooler tool. To eliminate systematic biases introduced by sequencing depth and technical preferences, all converted .cool format data were normalized using the Iterative Correction and Eigenvector Decomposition (ICE) method. 4.1.5. Construction of cCRE contact matrices based on Hi-C data. To quantitatively analyze the spatial concact frequencies between cCREs, we constructed contact matrices with cCREs as the basic unit using Hi-C data at a 10 kb resolution. Each cCRE segment was mapped to the coordinate system of the Hi-C matrix. For any given cCRE segment, we identified all bins that overlap with its genomic location. If a segment spans multiple genomic bins, these bins collectively represent the position of the cCRE in the Hi-C contact matrix. The contact strength between two cCREs segments was defined as: \({I_{ij}}=\frac{1}{{|{B_i}| \cdot |{B_j}|}}\sum\limits_{{u \in {B_i}}} {\sum\limits_{{v \in {B_j}}} {{M_{normalized}}(u,v)} }\) Where M normalized ( u,v ) represents the normalized Hi-C concact frequency between bin i and bin j , B i is the set of Hi-C bins covered by cCRE i ​, and B j is the set of Hi-C bins covered by cCRE j ​.| B i |and | B j | represent the number of bins covered by each respective cCRE. For each chromosome, cCREs contact matrix \({M_{CRE}} \in {R^{n \times n}}\) was generated, where n denotes the number of valid cCREs successfully mapped on the chromosome. 4.1.6. Acquisition of condensed spatial features for cCREs. To obtain the spatial features of each cCRE in three-dimensional space, we applied the Truncated Singular Value Decomposition (SVD) dimensionality reduction algorithm to map the high-dimensional contact matrix of cCREs to a lower-dimensional latent space. The top k principal components (where k = 32,64) were retained, ensuring that the cumulative explained variance exceeded 50%, preserving the essential topological and distance relationships. The resulting features were then scaled to the range [0, 1] using Min-Max scaling, ensuring consistent feature scaling and allowing the reduced vectors to capture spatial proximity information for the cCREs. In Hi-C data, concact frequencies reflect the proximity of chromatin fragments in three-dimensional space. cCREs with similar interaction patterns correspond to low-dimensional vectors with high similarity, indicating that these segments likely have spatial proximity in the three-dimensional genome and reflect co-localization trends. Therefore, the low-dimensional vectors also approximate the relative positioning of cCREs in three-dimensional space. Additionally, cCREs with similar features in certain principal component dimensions may share similar regulatory functions, reflecting cooperative regulatory trends. In summary, this spatial condensation method not only condenses the high-dimensional contact information between cCREs but also retains the core features of their spatial structure. 4.1.7. RNA data processing. For each gene, the transcriptional start site (TSS) of the transcript with the highest expression was selected as the representative TSS. Additionally, the expression values of all transcripts corresponding to a gene were summed to represent the gene’s overall expression level. To address data skewness and stabilize variance, a logarithmic transformation (log 10 (TPM + 1)) was applied. 4.1.8. Selection of cCREs in the proximal regions of genes. The TSS of each gene was used to identify cCREs in its upstream and downstream regions. The distance between all cCREs on the same chromosome and the TSS was calculated, and the 90 closest cCREs upstream and downstream were selected. To focus on those most likely involved in gene regulation, only cCREs within 1.5 Mb of the TSS were retained, a distance sufficient to capture typical cCREs. 4.1.9. Transcription factor and target genes retrieval. Transcription factors (TFs) were retrieved by performing motif enrichment analysis on all cCREs for each cell line using Homer. Significant motifs (P < 0.05) were selected and mapped to corresponding TFs. To ensure data validity, only TFs with available ChIP-seq data in the ENCODE database were retained for analysis. We integrated the target genes of TFs from the TRRUST ( https://www.grnpedia.org/trrust/ ) and hTFtarget ( https://guolab.wchscu.cn/hTFtarget/#!/ ) databases. 4.1.10. Enhancer and Super-Enhancer Data Retrieval. The enhancer data for GM12878, HepG2, K562, and HeLaS3 cell lines were downloaded from the FANTOM5 database ( https://fantom.gsc.riken.jp/5/ ), while the super-enhancers for the GM12878, HepG2, K562, HeLaS3 and HCT116 cell lines were retrieved from the dbSUPER database ( https://asntech.org/dbsuper/ ). 4.2 Dataset Construction For each cell line, each gene was treated as a sample. The model's input for each gene comprised a concatenated sequence of its cCREs, which were individually represented by a 350-base sequence. Each base was encoded with a 6-dimensional one-hot vector and 16-dimensional epigenetic features. Spatial information was captured by a 32-dimensional spatial features for each element, supplemented by linear chromosomal positions and gene-level positional data. A mask vector was used to handle variable numbers of cCREs per gene. The target labels for the model were the transformed gene expression values. To ensure the model focused solely on the impact of cCREs, gene sequences and their corresponding epigenetic signals at the gene locus were intentionally masked. This approach prevented information leakage and allowed the model to learn the regulatory relationship through the “pathway” between cCREs and their target genes. To evaluate the model’s generalization and robustness across different genomic contexts, a chromosome-based partitioning strategy was employed. Chromosome 16 was designated as the validation set due to its enrichment in low-copy repeat sequences and structural variation hotspots 43 , 44 , presenting a challenge for the model. Chromosome 6 was chosen as the test set, as its moderate length, high gene density, and complex regions 45 – 47 make it a representative and robust benchmark for evaluating performance on unseen genomic contexts. All other autosomes were used for training. 4.3 Model Design 4.3.1. Model Architecture Design. DeepCis integrates information from cCREs and genes to predict target gene expression levels and model regulatory relationships using two cascaded Transformer encoders. The core of the Transformer encoder is the multi-head attention mechanism, where attention weights are calculated for each token in each head relative to other tokens in the sequence, capturing global contextual relationships 48 . The outputs of all heads are concatenated and passed through a linear transformation to obtain the final multi-head output. The model first maps genomic sequence features, epigenetic features, and spatial embedding features into continuous 256-dimensional vectors. These mapped features are concatenated along their dimensions and passed through a linear layer and layer normalization to form a unified 256-dimensional joint feature. In the feature extraction phase, this joint feature is input into a 1D convolutional module with a stride of 5 to extract local patterns and downsample the sequence length. The extracted feature sequence is then fed into the first Transformer encoder, where relative positional bias is applied to incorporate base positional information. The 256-dimensional representation of the final token from the encoder output is used as an initial feature for the cCREs or genes. The spatial embedding features are then added to this initial feature to form the final feature representation for each cCRE and gene. In subsequent modeling, the target gene and its 180 corresponding cCREs serve as input sequences to the second Transformer encoder. Multi-scale relative positional biases are constructed based on the genomic coordinates of the cCREs and target gene, enabling the model to effectively incorporate the impact of three-dimensional spatial constraints on gene regulation at different resolutions. Finally, the output of the second encoder is layer-normalized, and the feature representation corresponding to the target gene token is extracted and mapped to a one-dimensional output via a fully connected layer to predict the target gene’s expression. We extracted the attention scores from the final layer of the second encoder and aggregated the attention values across all positions by averaging. The resulting score, termed the Regulatory Contribution Score (RCS), quantifies the degree of attention a gene gives to cCREs or the attention one cCRE gives to another, thereby quantifying the contribution of each cCRE to gene expression regulation. 4.3.2 Dual masks design. To reflect the biological directionality of regulation, a base mask is introduced in the encoder to restrict information flow, allowing cCREs to focus on genes while limiting the reverse flow from cCREs to genes, modeling the directionality of regulation. Another mask in DeepCis allows for the selective masking of one or more cCREs within the input, enabling precise investigation of the role of cCREs. 4.3.3 Model Training. The model was trained using the AdamW optimizer with a ReduceLROnPlateau learning rate scheduler, which dynamically adjusts the learning rate based on the validation loss. The loss function employed was mean squared error (MSE). To prevent overfitting, early stopping was applied, halting training if the validation loss did not improve for 15 consecutive epochs. Dropout with a rate of 0.1 was used during training. In the initialization phase, the model was pre-initialized with the CREaTor 49 pretrained model, with incompatible parameters initialized using Xavier and Kaiming random initialization. The DNA embedding layer was frozen, while all other layers were updated during training. 4.3.4. Model Comparison. To further evaluate the performance of our model, we referred to the K562 functional enhancer dataset, validated through CRISPR/Cas9 gene editing, which is provided in CREaTor 23 . Using the results from CREaTor, we compared the performance of DeepCis, ABC, Enformer, GraphReg, and CREaTor in predicting enhancers in the K562 cell line. 4.4. Definition of Regulatory Contribution Element Grouping For each gene, the corresponding cCREs are ranked by their contribution scores RCS from high to low. The top 10% of cCREs are defined as the "high regulatory contribution group," while the bottom 10% are defined as the "low regulatory contribution group." 4.5. Identification of Optimal Epigenomic Feature Combinations We performed an exhaustive search to identify the epigenomic features combinations that maximize the Pearson correlation coefficient on the test set for chromosome 6, retaining 2 to 6 features for each cell line. 4.6. TAD Identification In this study, we used 40 kb resolution Hi-C data and the hicFindTADs tool to identify TADs in various cell lines. 4.7. Validation of regulatory direction using TCGA data To validate the regulatory direction predicted by DeepCis, we acquired TCGA datasets from the Genomic Data Commons (GDC) Data Portal( https://portal.gdc.cancer.gov/ ), encompassing multiple cancer types: diffuse large B-cell lymphoma (TCGA-DLBC), liver hepatocellular carcinoma (TCGA-LIHC), acute myeloid leukemia (TCGA-LAML), cervical squamous cell carcinoma and endocervical adenocarcinoma (TCGA-CESC), and colon adenocarcinoma (TCGA-COAD).For samples in these datasets, we calculated the correlation between the expression levels of TFs and their corresponding target genes. We then compared these correlations with the relationships between the predicted RCSs and the ChIP-seq signal intensities of the corresponding TFs. By examining whether the signs of the correlations were consistent, we validated the consistency of the regulatory direction. A match in sign between the correlation of transcription factor-target gene expression and the model's predicted RCS scores indicates that the regulatory direction is aligned, supporting the accuracy of our model. Declarations Conflicts of Interest The authors declare no conflicts of interest. Funding: This work was supported by the National Natural Science Foundation of China (32170674). Author Contribution H.Z., S.W.N. and P.W. conceived the research direction and guided the research design. X.R.L. designed the study, drafted the initial manuscript and performed data analysis. W.B.D., J.J.D., M.M.L. and S.S.W. conducted data collection. M.Y.X., Y.L.X., H.B.Z. and Y.S.Z. verified the research results. W.F. and Y.J.L. were responsible for data visualization.All authors critically revised the manuscript for important intellectual content and approved the final version for publication. Acknowledgements The author thanked the National Natural Science Foundation of China (32170674). Data Availability The data that support the findings of this study are available upon request. The following databases were used in this research: SCREEN ( [http://screen.encodeproject.org/](http:/screen.encodeproject.org) ), ENCODE ( [https://www.encodeproject.org/](https:/www.encodeproject.org) ), 4D Nucleome ( [https://4dnucleome.org/](https:/4dnucleome.org) ), FANTOM5 ( [https://fantom.gsc.riken.jp/5/](https:/fantom.gsc.riken.jp/5) ), TRRUST ( [https://www.grnpedia.org/trrust/](https:/www.grnpedia.org/trrust) ), hTFtarget ( [https://guolab.wchscu.cn/hTFtarget/#!/](https:/guolab.wchscu.cn/hTFtarget) ), Genomic Data Commons (GDC) ( [https://portal.gdc.cancer.gov/](https:/portal.gdc.cancer.gov) ), and dbSUPER ( [https://asntech.org/dbsuper/](https:/asntech.org/dbsuper) ). References Wittkopp, P. J. & Kalay, G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 13 , 59–69 (2012). Ong, C.-T. & Corces, V. G. Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat. Rev. Genet. 12 , 283–293 (2011). Lee, T. I. & Young, R. A. Transcriptional Regulation and Its Misregulation in Disease. Cell 152 , 1237–1251 (2013). Cis-Regulatory Alterations in FOXP4 Modulate Esophageal Cancer Susceptibility Induced by Chronic Alcohol Exposure - PubMed. https://pubmed.ncbi.nlm.nih.gov/40228145/. Castro-Mondragon, J. A. et al. Cis-regulatory mutations associate with transcriptional and post-transcriptional deregulation of gene regulatory programs in cancers. Nucleic Acids Res. 50 , 12131–12148 (2022). Kim, K. L. et al. Dissection of a CTCF topological boundary uncovers principles of enhancer-oncogene regulation. Mol. Cell 84 , 1365-1376.e7 (2024). Zhao, S. G. et al. Integrated analyses highlight interactions between the three-dimensional genome and DNA, RNA and epigenomic alterations in metastatic prostate cancer. Nat. Genet. 56 , 1689–1700 (2024). ENCODE data at the ENCODE portal - PubMed. https://pubmed.ncbi.nlm.nih.gov/26527727/. Thomas, H. F. et al. Enhancer cooperativity can compensate for loss of activity over large genomic distances. Mol. Cell 85 , 362-375.e9 (2025). Costa, J. R. et al. Transcription factor cooperativity at a GATA3 tandem DNA sequence determines oncogenic enhancer-mediated activation. Cell Rep. 44 , 115705 (2025). Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning | Nature Communications. https://www.nature.com/articles/s41467-021-23576-0#ethics. Hi–C: A comprehensive technique to capture the conformation of genomes. Methods 58 , 268–276 (2012). ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing | Genome Biology. https://link.springer.com/article/10.1186/gb-2010-11-2-r22. HiChIP: efficient and sensitive analysis of protein-directed genome architecture | Nature Methods. https://www.nature.com/articles/nmeth.3999. Proximity Ligation-Assisted ChIP-Seq (PLAC-Seq) | SpringerLink. https://link.springer.com/protocol/10.1007/978-1-0716-1597-3_10. Fulco, C. P. et al. Activity-by-Contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51 , 1664–1669 (2019). Ahmed, F. S., Aly, S. & Liu, X. EPI-Trans: an effective transformer-based deep learning model for enhancer promoter interaction prediction. BMC Bioinformatics 25 , 216 (2024). Zhang, T., Zhao, X., Sun, H., Gao, B. & Liu, X. GATv2EPI: Predicting Enhancer-Promoter Interactions with a Dynamic Graph Attention Network. Genes 15 , 1511 (2024). KansformerEPI: a deep learning framework integrating KAN and transformer for predicting enhancer-promoter interactions - PubMed. https://pubmed.ncbi.nlm.nih.gov/40515390/. Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16 , e1008050 (2020). Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18 , 1196–1203 (2021). Chromatin interaction-aware gene regulatory modeling with graph attention networks - PubMed. https://pubmed.ncbi.nlm.nih.gov/35396274/. Li, Y. et al. CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms. Genome Biol. 24 , 266 (2023). Integrative analysis reveals functional and regulatory roles of H3K79me2 in mediating alternative splicing | Genome Medicine. https://link.springer.com/article/10.1186/s13073-018-0538-1. Precise targeting of POLR2A as a therapeutic strategy for human triple negative breast cancer | Nature Nanotechnology. https://www.nature.com/articles/s41565-019-0381-6. Bclaf1 mediates super-enhancer-driven activation of POLR2A to enhance chromatin accessibility in nitrosamine-induced esophageal carcinogenesis. J. Hazard. Mater. 492 , 138218 (2025). Why the activity of a gene depends on its neighbors: Trends in Genetics. https://www.cell.com/trends/genetics/abstract/S0168-9525(15)00129-8. Schindler, M. et al. Induction of kidney-related gene programs through co-option of SALL1 in mole ovotestes. Development 150 , dev201562 (2023). Young LINE-1 transposon 5′ UTRs marked by elongation factor ELL3 function as enhancers to regulate naïve pluripotency in embryonic stem cells | Nature Cell Biology. https://www.nature.com/articles/s41556-023-01211-y. Promoter-proximal CTCF binding promotes distal enhancer-dependent gene activation | Nature Structural & Molecular Biology. https://www.nature.com/articles/s41594-020-00539-5. SMARCB1 loss activates patient-specific distal oncogenic enhancers in malignant rhabdoid tumors | Nature Communications. https://www.nature.com/articles/s41467-023-43498-3. TAD boundary deletion causes PITX2-related cardiac electrical and structural defects | Nature Communications. https://www.nature.com/articles/s41467-024-47739-x. CUX1 restrains latent hematopoietic stem cell plasticity by suppressing stem cell-intrinsic inflammatory pathways. Blood https://doi.org/10.1182/blood.2024026815 (2025) doi:10.1182/blood.2024026815. FOXP1 directly represses transcription of proapoptotic genes and cooperates with NF-κB to promote survival of human B cells | Blood | American Society of Hematology. https://ashpublications.org/blood/article/124/23/3431/33456/FOXP1-directly-represses-transcription-of. Mechanism of ERBB2 gene overexpression by the formation of super-enhancer with genomic structural abnormalities in lung adenocarcinoma without clinically actionable genetic alterations | Molecular Cancer. https://link.springer.com/article/10.1186/s12943-024-02035-6. Super-enhancer-driven SLCO4A1-AS1 is a new biomarker and a promising therapeutic target in glioblastoma | Scientific Reports. https://www.nature.com/articles/s41598-024-82109-z. Dysregulated CREB3 cleavage at the nuclear membrane induces karyoptosis-mediated cell death | Experimental & Molecular Medicine. https://www.nature.com/articles/s12276-024-01195-1. TLN1: an oncogene associated with tumorigenesis and progression | Discover Oncology. https://link.springer.com/article/10.1007/s12672-024-01593-x. Targeting CA9 restricts pancreatic cancer progression through pH regulation and ROS production | Cellular Oncology. https://link.springer.com/article/10.1007/s13402-024-01022-9. The diverse repertoire of ISG15: more intricate than initially thought | Experimental & Molecular Medicine. https://www.nature.com/articles/s12276-022-00872-3. ATAD3A: A Key Regulator of Mitochondria-Associated Diseases. https://www.mdpi.com/1422-0067/24/15/12511. A Novel HCC Prognosis Predictor EEF1E1 Is Related to Immune Infiltration and May Be Involved in EEF1E1/ATM/p53 Signaling - PubMed. https://pubmed.ncbi.nlm.nih.gov/34282404/. Clinical Implications of Chromosome 16 Copy Number Variation | Molecular Syndromology | Karger Publishers. https://karger.com/msy/article-abstract/13/3/184/825184/Clinical-Implications-of-Chromosome-16-Copy-Number?redirectedFrom=fulltext. Martin, J. et al. The sequence and analysis of duplication-rich human chromosome 16. Nature 432 , 988–994 (2004). Copy Neutral LOH Affecting the Entire Chromosome 6 Is a Frequent Mechanism of HLA Class I Alterations in Cancer. https://www.mdpi.com/2072-6694/13/20/5046. DNA Methylation and Transcription of HLA-F and Serum Cytokines Relate to Chinese Medicine Syndrome Classification in Patients with Chronic Hepatitis B | Chinese Journal of Integrative Medicine. https://link.springer.com/article/10.1007/s11655-021-3279-8. Thomas, C. et al. TERT promoter mutation and chromosome 6 loss define a high-risk subtype of ependymoma evolving from posterior fossa subependymoma. Acta Neuropathol. (Berl.) 141 , 959–970 (2021). Vaswani, A. et al. Attention is All you Need. in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017). Li, Y. et al. CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms. Genome Biol. 24 , 266 (2023). Additional Declarations No competing interests reported. Supplementary Files supportingInformation.docx Supporting Information: Supplementary Table 1: Candidate CREs and Epigenomic Data across Different Cell Lines (ENCODE Data). Supplementary Fig. S1 Robustness of DeepCis in gene expression prediction. Supplementary Fig. S2 Quantification and identification of key cCREs by DeepCis. Supplementary Fig. S3 DeepCis's ability to capture different cCREs. Supplementary Fig. S4 Enrichment analysis of transcription factor binding sites and validation of regulatory directionality. Supplementary Table 2: Candidate CREs that are among the top 20 in frequency and overlap with known super-enhancers. Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 27 Apr, 2026 Reviewers invited by journal 05 Mar, 2026 Editor invited by journal 03 Mar, 2026 Editor assigned by journal 28 Feb, 2026 Submission checks completed at journal 28 Feb, 2026 First submitted to journal 27 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8988269","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":602458842,"identity":"90deccfe-f9c4-432c-b153-08dfc6f527ed","order_by":0,"name":"Xinrong Li","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Xinrong","middleName":"","lastName":"Li","suffix":""},{"id":602458843,"identity":"d2519c90-a794-4a71-b6eb-d1ddd31212ea","order_by":1,"name":"Wei Feng","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Wei","middleName":"","lastName":"Feng","suffix":""},{"id":602458844,"identity":"ff1e77c7-fd2a-4014-b54b-00133b08e6a7","order_by":2,"name":"Junjie Duan","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Junjie","middleName":"","lastName":"Duan","suffix":""},{"id":602458845,"identity":"19ef1f1d-7ede-443d-855c-724c1802f662","order_by":3,"name":"Yujie Liu","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Yujie","middleName":"","lastName":"Liu","suffix":""},{"id":602458846,"identity":"e706dd28-3dcb-45ba-b42c-1c1fdb13a061","order_by":4,"name":"Yusen Zhao","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Yusen","middleName":"","lastName":"Zhao","suffix":""},{"id":602458847,"identity":"7d9da4e1-6450-4f5e-8abf-5b3c7a9fe677","order_by":5,"name":"Shuangshuang Wang","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Shuangshuang","middleName":"","lastName":"Wang","suffix":""},{"id":602458848,"identity":"237238a0-042a-4598-9be7-a724b65f58b4","order_by":6,"name":"Hongbo Zhu","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Hongbo","middleName":"","lastName":"Zhu","suffix":""},{"id":602458849,"identity":"f53aa061-d843-47d8-abad-b5abfd43d7a5","order_by":7,"name":"Manyi Xu","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Manyi","middleName":"","lastName":"Xu","suffix":""},{"id":602458850,"identity":"f2b7b968-000c-4691-9cd7-7737a5d7ffe4","order_by":8,"name":"Yongle Xu","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Yongle","middleName":"","lastName":"Xu","suffix":""},{"id":602458851,"identity":"341143ae-0410-44bd-bd99-23171091f31a","order_by":9,"name":"Mengmeng Liu","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Mengmeng","middleName":"","lastName":"Liu","suffix":""},{"id":602458852,"identity":"6a9df717-352d-45d7-945c-ff330690dd65","order_by":10,"name":"Wenbo Dong","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Wenbo","middleName":"","lastName":"Dong","suffix":""},{"id":602458853,"identity":"f8b90cf9-b612-493c-a491-96557c987dd3","order_by":11,"name":"Peng Wang","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Peng","middleName":"","lastName":"Wang","suffix":""},{"id":602458854,"identity":"75acb202-1d54-4ebe-8ea1-770b41d3d14c","order_by":12,"name":"Shangwei Ning","email":"","orcid":"","institution":"Harbin Medical University","correspondingAuthor":false,"prefix":"","firstName":"Shangwei","middleName":"","lastName":"Ning","suffix":""},{"id":602458855,"identity":"6915b29a-5dec-4e3a-8bb9-7f14cf36e204","order_by":13,"name":"Hui Zhi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6UlEQVRIiWNgGAWjYDACCcY2BgYDGxCTjYGxASxmQIyWNJK0AFUyMBwmQYv87Oa2xzwF5+V0+w+wPa7csS2xgb15mwRDzR2cWhjnHGw35jG4bWx2I4Hd8OyZ24kNPMfKJBiOPcOphVkisU0aqCVx2w0GNsnGNqAWiRwzCcaGwzi1sEG0nKvfdv4AVIv8G/xaeCBaDiSYHUiA2cKDX4sEUIvkHINkw203gIzGM7eN23jSii0SjuHWIj8j/ZnEmz928mbnDx+TbNxxW7af/fDGGx9qcGtBAtBIAUUTQwIxGkbBKBgFo2AU4AQA+hlSygZd57oAAAAASUVORK5CYII=","orcid":"","institution":"Harbin Medical University","correspondingAuthor":true,"prefix":"","firstName":"Hui","middleName":"","lastName":"Zhi","suffix":""}],"badges":[],"createdAt":"2026-02-27 12:54:42","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8988269/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8988269/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104434881,"identity":"e361d1bf-2fb7-42ee-83ab-31e63e4c875b","added_by":"auto","created_at":"2026-03-11 16:31:01","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1081720,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverview of the DeepCis framework.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003eData preprocessing.Fine-grained features of cCREs are extracted at the base pair level from genomic sequence and epigenomic signals, and condensed spatial features of cCREs are obtained at the region level. \u003cstrong\u003eb\u003c/strong\u003e DeepCis architecture. The framework consists of the sample input module, feature mapping module, feature fusion encoding module, and interaction representation encoding module, with input features from cCREs within 1.5 MB upstream and downstream of the TSS. \u003cstrong\u003ec\u003c/strong\u003e Tasks achievable with DeepCis.\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/aff351c2d5945dd86d819044.png"},{"id":104434887,"identity":"caad69fe-3e79-4654-9b77-d8445b002c1e","added_by":"auto","created_at":"2026-03-11 16:31:01","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":853687,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eRobustness of DeepCis in gene expression prediction\u003c/strong\u003e. \u003cstrong\u003ea\u003c/strong\u003e The impact of different feature fusion strategies on model performance. \u003cstrong\u003eb\u003c/strong\u003e Performance of different fusion strategies and spatial feature dimensions (32 and 64) on the test set, comparing the difference between initializing with a pretrained model (w) and without (o). \u003cstrong\u003ec\u003c/strong\u003e Prediction results on chromosome 6 of the test set across different cell lines. \u003cstrong\u003ed\u003c/strong\u003e Prediction performance on various chromosomes in the unseen HCT116 cell line. \u003cstrong\u003ee\u003c/strong\u003e Prediction results after randomly removing a subset of epigenomic features (removal range: 5-13 features, iterated 10 times) and evaluating robustness using the Pearson correlation coefficient.\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/81f19e24624e5b06844f2cdd.png"},{"id":104780075,"identity":"38cb5dbe-b1dc-4382-b3e3-78050d860959","added_by":"auto","created_at":"2026-03-17 07:50:08","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1109299,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eQuantification and identification of key cCREs by DeepCis\u003c/strong\u003e. \u003cstrong\u003ea\u003c/strong\u003e Enrichment distribution of enhancers and promoters in high and low regulatory contribution group. \u003cstrong\u003eb\u003c/strong\u003e Fold enrichment of enhancers and promoters in high regulatory contribution group relative to low regulatory contribution group across cell lines. \u003cstrong\u003ec\u003c/strong\u003e AUC performance of the model on experimentally validated K562 enhancer data with all input Epigenomic features. \u003cstrong\u003ed \u003c/strong\u003eEpigenomic features that correspond to the maximum Pearson correlation coefficient in K562 prediction when retaining 2–6 features.\u003cstrong\u003e e–i\u003c/strong\u003e Overall AUC of the model across all chromosomes and the maximum and minimum AUC values for each chromosome when retaining the optimal number (2–6) of features. \u003cstrong\u003ej\u003c/strong\u003e Performance comparison of different models based on AUC.\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/ca967f9316bef46d97b78502.png"},{"id":104434883,"identity":"f4003263-4080-4575-9c77-6df7175d2fa5","added_by":"auto","created_at":"2026-03-11 16:31:01","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1229080,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDeepCis's ability to capture different cCREs.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003eSpearman correlation between the regulatory contribution scores of cCREs and their distance to the gene TSS. \u003cstrong\u003eb\u003c/strong\u003e Distribution of the types of the top ten high-contribution cCREs for genes across chromosomes in the GM12878 cell line. c Differences in the co-localization of genes and their top ten high-contribution cCREs within TADs in the GM12878 cell line.\u003c/p\u003e","description":"","filename":"image7.png","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/3ea03ed4a1573a507400d03b.png"},{"id":104808359,"identity":"310f4d7a-f6e0-4f07-b9c3-9c49ebca1362","added_by":"auto","created_at":"2026-03-17 12:36:29","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1124299,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eIdentification of core TFs and prediction of key cCREs by DeepCis.\u003c/strong\u003e \u003cstrong\u003ea\u003c/strong\u003e Comparison of the enrichment of transcription factor binding peaks on cCREs in the high and low regulatory contribution groups across different cell lines.\u003cstrong\u003eb\u003c/strong\u003e For each transcription factor (y-axis), each point represents a target gene associated with that factor. The x-axis shows the Spearman correlation coefficient, measuring the correlation between the model-predicted regulatory scores and TF ChIP-seq signal intensity for cCREs with binding peaks on the gene. Positive correlations indicate that the transcription factor tends to promote the target gene's transcriptional activity, while negative correlations suggest a repressive effect. \u003cstrong\u003ec\u003c/strong\u003e Comparison of the predicted regulatory directions from DeepCis with the correlation directions between TFs and target gene expression levels in TCGA samples, using NRF1 in the K562 cell line as an example. \u003cstrong\u003ed\u003c/strong\u003e Two distinct regulatory modes, transcriptional activation and transcriptional repression, and their regulatory mechanisms.\u003cstrong\u003ee\u003c/strong\u003e Example of DeepCis predicting core hub cCREs, showing that such cCREs can simultaneously influence both the target gene and other cCREs.\u003c/p\u003e","description":"","filename":"image8.png","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/bcb67c70fa9de4180e63af32.png"},{"id":104780325,"identity":"bc390d32-bc0d-477d-8212-9315b3099b7b","added_by":"auto","created_at":"2026-03-17 07:52:20","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":1210399,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePerturbation Analysis Results. a\u003c/strong\u003e Model performance across different cell lines. Each data point represents a chromosome, with the y-axis showing the Pearson correlation coefficient between predicted values and gene knockdown experimental observations. The x-axis represents percentile ranges of regulatory contribution scores, where \"No\" indicates no element knockdown, “T” refers to the top 5–30% of cCREs ranked by DeepCis predicted RCSs, and “B” refers to the bottom 5–30%. \u003cstrong\u003eb\u003c/strong\u003e Comparison of gene expression levels before and after gene knockdown for specific genes on chromosome 6 across different cell lines.\u003c/p\u003e","description":"","filename":"image9.png","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/9bcf90150e409cdcba714606.png"},{"id":104809056,"identity":"11dc7aae-00bc-4c15-b71d-c41b6cd049ea","added_by":"auto","created_at":"2026-03-17 12:47:00","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7451922,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/a5d8fa15-88ec-4435-b77d-a919ca280111.pdf"},{"id":104434884,"identity":"15b0f606-45e8-4ad9-ad50-a6410f949ff9","added_by":"auto","created_at":"2026-03-11 16:31:01","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":4820820,"visible":true,"origin":"","legend":"\u003cp\u003eSupporting Information:\u003c/p\u003e\n\u003cp\u003eSupplementary Table 1: Candidate CREs and Epigenomic Data across Different Cell Lines (ENCODE Data).\u003c/p\u003e\n\u003cp\u003eSupplementary Fig. S1 Robustness of DeepCis in gene expression prediction.\u003c/p\u003e\n\u003cp\u003eSupplementary Fig. S2 Quantification and identification of key cCREs by DeepCis.\u003c/p\u003e\n\u003cp\u003eSupplementary Fig. S3 DeepCis's ability to capture different cCREs.\u003c/p\u003e\n\u003cp\u003eSupplementary Fig. S4 Enrichment analysis of transcription factor binding sites and validation of regulatory directionality.\u003c/p\u003e\n\u003cp\u003eSupplementary Table 2: Candidate CREs that are among the top 20 in frequency and overlap with known super-enhancers.\u003c/p\u003e","description":"","filename":"supportingInformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-8988269/v1/7dd91cd14064cb9d875e2eb6.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"DeepCis: A Transformer-based Multi-Omics Framework for Cross-Scale Explainability in Gene Regulatory Element Identification","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eGene expression dysregulation represents a fundamental characteristic of cancer pathogenesis. CREs including promoters, enhancers, silencers, and insulators\u0026mdash;reside in non-coding genomic regions and precisely orchestrate transcription by recruiting transcription factors (TFs) and chromatin-associated proteins in a spatiotemporal manner\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. Disruption of these CREs through genetic mutations or epigenetic alterations can lead to aberrant transcriptional programs and cancer progression\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Gene regulation is inherently multifactorial, involving synergistic contributions from DNA sequence, chromatin accessibility, histone modifications, and higher-order 3D chromatin organization\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Recent advances in high-throughput sequencing and large-scale consortia such as ENCODE and 4D Nucleome have identified millions of cCREs\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e. However, determining which cCREs functionally influence specific genes, and quantifying their individual contributions to transcriptional output, remains a fundamental challenge in genomics\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eExperimental approaches such as CRISPR-Cas9 perturbations can validate regulatory activity at individual loci with high precision, but suffer from limitions in scalability, cost, and throughput, and fail to capture cooperative or compensatory interactions among cCREs across large genomic distances\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e. Chromatin conformation techniques (Hi-C\u003csup\u003e12\u003c/sup\u003e, ChIA-PET\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e, HiChIP\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e, and PLAC-seq\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e) provide physical contact maps between regulatory regions and promoters, but physical proximity does not necessarily equate to functional regulation.\u003c/p\u003e \u003cp\u003eComputational models have attempted to infer enhancer-gene relationships from sequence, chromatin accessibility, or 3D chromatin organization. The Activity-by-Contact (ABC) model conceptualizes enhancer influence as the product of enhancer activity and contact frequency but assumes linear additivity and may fail to account for non-linear cooperative regulation\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e. Deep learning models trained on binary enhancer-promoter interactions or CRISPR screens are constrained by limited positive samples and poor cross-cell-type generalizability\u003csup\u003e\u003cspan additionalcitationids=\"CR18\" citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. Basenji2 employs convolutional neural networks to capture local dependencies, but it remains limited in scope and may not effectively capture long-range dependencies\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e. Enformer captures long-range interactions from sequence alone but struggles with consistent performance in unseen cellular contexts\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. GraphReg integrates chromatin interactions with multi-omics features but retains bias toward local connectivity\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. CREaTor introduces zero-shot modeling of regulatory patterns from sequence or epigenomics but lacks integration of spatial chromatin organization and provides insufficient clarity explaining the biological interpretability of key cCREs\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThus, a critical unmet need exists for a unified, quantitative, and interpretable framework that (i) integrates genomic, epigenomic, and 3D chromatin organization across multiple scales, (ii) operates without predefined positive or negative regulatory labels, (iii) quantifies individual and cooperative contribution of each cCRE to gene expression, and (iv) maintains biological interpretability.\u003c/p\u003e \u003cp\u003eHere, we present DeepCis, a Transformer-based multi-omics framework that integrates DNA sequence, base-resolution epigenomic signals, and condensed spatial chromatin organization features derived from Hi-C data. DeepCis employs a zero-shot learning strategy and a two-stage encoders architecture to capture both local sequence determinants and global cCRE-gene interactions. Crucially, we introduce a Regulatory Contribution Score (RCS) to quantitatively assess each cCRE\u0026rsquo;s contribution to transcriptional output. Through multi-level interpretability analyses, we demonstrate that DeepCis robustly identifies key cCREs across cell types, captures both proximal dominance and distal enhancer activity, identifies transcription factor binding sites and cooperative regulation, reveals 3D chromatin constraints on gene regulation, and discovers hub cCREs overlapping super-enhancers. Collectively, DeepCis provides a scalable and interpretable strategy for decoding cis-regulatory architectures, offering mechanistic insights and potential applications in disease research and precision medicine.\u003c/p\u003e"},{"header":"2. Results","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Overview of the DeepCis Framework\u003c/h2\u003e \u003cp\u003eDeepCis comprises two cascaded Transformer encoders that integrate genomic sequence, epigenomic signals, and 3D chromatin organization features to quantitatively model cCRE-gene regulatory relationships and coordinative interactions among cCREs in a supervised learning framework. Initially, the base information of cCREs is processed through numerical and one-hot encoding. Epigenomic signals are then incorporated to extract base-level signal features, which are then normalized accordingly. For 3D chromatin organization, we construct cCREs contact matrices and perform singular value decomposition (SVD) for dimensionality reduction, obtaining condensed spatial features for each cCRE. This process preserves core spatial information, capturing the relative positional relationships and synergistic regulatory concacts of cCREs in 3D chromatin organization (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ea, Methods).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe DeepCis architecture(Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003eb, Methods) incorporates four core modules: input module, feature mapping module, feature fusion encoding module, and the interaction representation encoding module. This architecture enables extraction and fusion of complementary information types to decipher complex gene regulatory mechanisms. The feature mapping module processes each omics data type, converting raw features into high-dimensional representations suitable for integration. The feature fusion encoding module processes cCREs and target gene features, focusing on cross-scale feature fusion to generate abstract representations for each cCRE or gene. We employ distinct fusion strategies for features at different granularities. At the base pair level, DeepCis concatenates encoded genomic sequence features with corresponding epigenomic features, as they provide complementary information, with genomic sequence suppling genetic encoding information while epigenomic features indicating active state and regulatory potential. This concatenated feature is downsampled and passed through the first encoder to obtain initial representations of cCREs and gene regions. Relative positional bias is incorporated to capture relative base positions within the sequence, enhancing the model's understanding of internal sequence information. Spatial embedding representations (Methods), are then fused with the initial representation leveraging the complementary nature of spatial, sequence, and epigenomic information.\u003c/p\u003e \u003cp\u003eThe interaction representation encoding module integrates features from each target gene and its corresponding cCREs for global regulation, modeling relationships between different cCREs and between cCREs and genes. This module identifies key functional cCREs (enhancers and promoters) and transcription factor binding sites by quantifying cCREs contributions through the Regulatory Contribution Score (RCS) (Methods), providing a basis for biological interpretation. Based on abstract features extracted in previous modules, features of cCREs related to the target gene are integrated with the gene\u0026rsquo;s features to create a unified representation optimized for predicting gene expression. To capture precise regulatory relationships, we introduce dual masks based on biological priors (Methods).All DeepCis modules are jointly trained in an end-to-end framework.\u003c/p\u003e \u003cp\u003eDeepCis offers several advantages: First, it integrates comprehensive multi-omics data (genomic sequences, epigenomic signals, and 3D chromatin organization), enhancing accuracy and richness of cCREs representations. Second, it employs a flexible feature fusion mechanism that adapts strategies based on data scale to effectively combine various omics features, improving the model expressiveness. Through spatial representations, DeepCis captures the relative positioning of cCREs in 3D chromatin organization, identifying potential synergistic regulatory relationships. The multi-head attention mechanism of Transformer enables capture of long-range interactions between cCREs and between cCREs and genes. The model also introduces dual mask constraints (Methods). Finally, DeepCis excels in identifying key CREs, recognizing known enhancers, promoters, and transcription factor binding sites, while uncovering potential novel cCREs, providing insights into gene regulatory mechanisms (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2. Robust gene expression prediction by DeepCis\u003c/h2\u003e \u003cp\u003eWe evaluated DeepCis performance across cell lines and chromosomes. Independent tests were conducted on the test set from chromosome 6 in four cell lines and the unseen HCT116 cell line. DeepCis achieved an overall Pearson correlation coefficient of 0.752 on the test set, with a mean absolute error of 0.708 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). In the K562 cell line, the Pearson correlation for chromosome 6 reached 0.795(Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec). In the unseen HCT116 cell line, the average Pearson correlation across all chromosomes was 0.720, with the highest correlation of 0.747 on chromosome 22 (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ed). These results demonstrate strong generalization capability of DeepCis across different biological contexts.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo assess advantages of DeepCis, we first tested models trained with single-omic data on the test set. Pearson correlation coefficients for single-omic models were consistently lower than those of DeepCis (Supplementary Fig. \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003ea). Notably, Pearson correlation coefficients for single-omic models all exceeded 0.50, indicating that each individual omic data type contains significant predictive information for gene expression, validating the necessity and effectiveness of multi-omics itegration. We compared different dimensionalities of condensed spatial features and found that a 32-dimensional features achieved the optimal balance between retaining key information and reducing training complexity (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb). Different fusion strategies were compared, with concatenating genomic and epigenomic features before adding spatial embedding features yielding the highest model accuracy (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea). Model predictive performance remained stable even after removing subsets of features, with only minor changes in Pearson correlation coefficient (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee, Supplementary Fig. \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003eb), indicating robustness in handling epigenomic features.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3. Quantification of cCREs contributions and identification of key cCREs by DeepCis\u003c/h2\u003e \u003cp\u003eTo validate whether high-RCS cCREs predicted by DeepCis correspond to key cCREs in actual regulatory processes, we examined enrichment of enhancers and promoters in high and low regulatory contribution groups (Methods). We obtained known enhancer data from FANTOM5 database for GM12878, HepG2, K562, and HeLaS3 cell lines and identified promoter regions (TSS\u0026thinsp;\u0026plusmn;\u0026thinsp;1 kb) for each gene. Results show significant enrichment of enhancers and promoters in the high regulatory contribution group (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001)(Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ea), with average fold enrichment of 2 in high-RCSs group compared to low-RCSs group (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003eb).This confirms that the attention scores learned by DeepCis strongly correlated with biological functions of gene regulation, providing a solid basis for identifying key cCREs.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo provide direct functional evidence, we evaluated DeepCis on an established benchmark dataset of CRISPR/Cas9-validated enhancers and non-enhancers from K562 cell line, assessing the association between RCSs and functional cCREs, with performance quantified by the AUC. Prior to evaluation, we normalized predicted RCSs within the same gene to eliminate inter-gene variability and highlight relative regulatory strength. Results show close alignment between model-predicted RCSs and functional enhancers, with AUC of 0.88 (95% CI: 0.87\u0026ndash;0.89) (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ec). Model performance across different chromosomes showed highest AUC reaching 0.96 (chr4) (Supplementary Fig. S2a), demonstrating effective identification of functionally relevant cCREs. We further tested the selected optimal epigenomic feature combinations (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ed, Methods), integrating DNA sequence and spatial features. Performance remained stable and high across different numbers of retained epigenomic features (AUC: 0.86\u0026ndash;0.90) (Fig. e-i), demonstrating that DeepCis is robust and not heavily dependent on epigenomic features. Notably, H3K79me2, POLR2A, and DNase emerged as key markers influencing predictive accuracy of gene expression regulation in our model (Supplementary Fig. S2b, Methods), with performance improving as more feature combinations were included in selection process (Supplementary Fig. S2c). DNase hypersensitive regions mark open chromatin states, indicating enhanced accessibility for transcription factor binding. Previous studies established H3K79me2 functional regulatory role through co-transcriptional splicing mechanism\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e. The importance of POLR2A in cancer is well-established-identified as a \"synthetic vulnerability\" target in triple-negative breast cancer\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e\u003c/sup\u003e and driving abnormal tumor cell proliferation in esophageal cancer in cooperation with Bclaf1\u003csup\u003e26\u003c/sup\u003e. Comparison of DeepCis with other methods for identifying key cCREs demonstrated superior performance of DeepCis (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003ej, Methods).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4. DeepCis captures proximal dominance and distal activity of cCREs\u003c/h2\u003e \u003cp\u003ePrevious studies established that regulatory efficacy of cCREs is significantly influenced by proximity to target genes, exhibiting clear distance effect\u003csup\u003e\u003cspan additionalcitationids=\"CR28\" citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. To systematically evaluate whether DeepCis captures distance dependency of cCREs regulation on gene expression, we analyzed relationship between RCSs of cCREs learned by the model and distance to the target gene across five cell lines. Results consistently showed significant negative correlation between RCSs and distance to target gene (Spearman correlation coefficient from \u0026minus;\u0026thinsp;0.723 to -0.698, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01) (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea), indicating that the model captures distance decay effects commonly observed in genomic regulation, consistent with prior studies.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eHowever, accumulating evidence supports critical roles of distal enhancers in complex gene regulatory networks\u003csup\u003e\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e,\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. To assess the model\u0026rsquo;s ability to identify distal regulation, we further analyze types of cCREs in high regulatory contribution group. In GM12878 cell line, proximal enhancer-like signatures (pELS) accounted for 32.2%, promoter-like signatures (PLS) for 19.0%, distal enhancer-like signatures (dELS) for 15.6%, and CTCF-only and CTCF-bound cCREs for 12.6%. To minimize potential for random variation, we validated results across all chromosomes in each cell line, observing consistent trends(Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb, Supplementary Fig. S3a-S3d). These findings demonstrate the model integrates both proximal regulatory dominance and key long-range regulatory effects, overcoming simple distance bias.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5. DeepCis accounts for 3D chromatin organization impact on gene regulation\u003c/h2\u003e \u003cp\u003eThe genome is organized in intricate three-dimensional structures through complex chromatin folding, imposing critical constraints on precise gene regulation\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. Topologically Associating Domains (TADs) represent fundamental functional units in the genome 3D organization, where cCREs within same domain interact more frequently, while across-TAD interactions are relatively rare\u003csup\u003e\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. We systematically analyzed high regulatory contribution group (Methods) located within same or different TADs (Methods)from their target genes. Results show RCSs of cCREs within same TAD as target gene are significantly higher than those located in different TADs (Mann-Whitney U test, p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001). This significant TAD co-localization between high contribution cCREs and genes was consistently observed across the analyzed cell lines (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ec, Supplementary Fig. S3e), strongly suggesting that DeepCis effectively accounts for the impact of the 3D chromatin organization on gene regulation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e2.6. DeepCis identifies core transcription factor binding sites associated with active regulation\u003c/h2\u003e \u003cp\u003eTo investigate the association between high RCSs cCREs and the binding of active TFs, we integrated ChIP-seq data for 108 known TFs across five cell lines with target gene information (Methods). We analyzed the correlation between RCSs predicted by DeepCis and enrichment of transcription factor binding sites for these target genes. Comparison of TFs binding site counts between high and low regulation contribution groups (Methods) revealed significant enrichment of 87 TFs on high RCSs cCREs (Fisher\u0026rsquo;s Exact Test, p\u0026thinsp;\u0026lt;\u0026thinsp;0.05), with an average enrichment factor of 2.393 (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea, Supplementary Fig. S4a), demonstrating that DeepCis effectively identifies core regulatory regions highly associated with transcription factor binding. Notably, some TFs were not enriched in high-contribution cCREs, possibly due to shared regulation of adjacent genes by neighboring cCREs. Additionally, TFs such as CUX1 and FOXP1, implicated in transcriptional repression in existing studies\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e,\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e, may suppress gene expression or maintain dormant state upon DNA binding, potentially explaining the lower contribution scores for their associated cCREs.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo quantify the relationship between RCSs of cCREs and TF activity, we focused on the cCREs overlapping with TF binding peaks. We analyzed the correlation between RCSs of these cCREs and corresponding TF ChIP-seq signal intensities. Two distinct correlation patterns emerged at gene level: first group showed significant positive correlation between RCSs of cCREs and TF binding signal intensity, with an average Spearman correlation coefficient of 0.7, representing activation mode(Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb). Second group exhibited negative correlation, with an average Spearman coefficient of -0.77 (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003eb), indicating regulation by transcriptional repression mechanisms. We validated this correlation using tissue-specific data and corresponding cell lines, finding DeepCis predictions consistent with observed directions (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec, Supplementary Fig. S4b, Methods). For instance, activation of \u003cem\u003eROCK2\u003c/em\u003e by transcription factor NRF1 in the K562 cell line showed significant positive correlation (Spearman correlation\u0026thinsp;=\u0026thinsp;0.655, p\u0026thinsp;=\u0026thinsp;0.001), orroborated in acute myeloid leukemia (LAML) cohort (Spearman correlation\u0026thinsp;=\u0026thinsp;0.372, p =\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(2.40 \\times 10{\\text{e-}}6\\)\u003c/span\u003e\u003c/span\u003e). Conversely, repression of \u003cem\u003eTYROBP\u003c/em\u003e by NRF1 in K562 showed a significant negative correlation (Spearman correlation = -0.354, p\u0026thinsp;=\u0026thinsp;0.004), supported by LAML data (Spearman correlation = -0.240, p\u0026thinsp;=\u0026thinsp;0.003)(Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec, \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed). However, Negative correlations may reflect complex inhibitory mechanisms, such as competitive binding or recruitment of chromatin remodeling complexes that indirectly suppress gene expression. Overall, these findings suggest that DeepCis captures multi-level regulatory patterns.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e2.7. DeepCis predicts hub CREs in gene regulatory systems\u003c/h2\u003e \u003cp\u003eBeyond quantifying regulatory relationships between genes and cCREs, DeepCis quantifies the synergistic interactions between cCREs, revealing global regulatory architectures. We hypothesize that cCREs frequently forming high-intensity synergistic relationships with other cCREs in different gene regulatory networks may serve as core hubs genome-wide. To test this, we summarized and analyzed the high regulatory contribution group (Methods) identified for each gene as high contribution cCREs to identify key nodes with global influence.\u003c/p\u003e \u003cp\u003eGenome-wide statistical analysisrevealed that only a few cCREs were repeatedly selected across different gene regulatory networks with high occurrence frequency. Some high-frequency cCREs overlapped known super-enhancer regions (Supplementary Table\u0026nbsp;2), critical for regulating genes involved in cell identity, development, and disease processes\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e,\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e\u003c/sup\u003e. For example, in K562, chr9:35732261\u0026ndash;35732610 region overlapped with a super-enhancer associated with key genes including \u003cem\u003eCREB3\u003c/em\u003e (transcription factor involved in stress response and cell growth regulation\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e), \u003cem\u003eTLN1\u003c/em\u003e (key molecule in cell adhesion, migration, and signal transduction\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e), and \u003cem\u003eCA9\u003c/em\u003e (carbonic anhydrase family member promoting tumor growth and metastasis in acidic tumor environments\u003csup\u003e\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e). In the HCT116, chr1:1308334\u0026ndash;1308678 region also overlapped a super-enhancer associated with key genes including \u003cem\u003eISG15\u003c/em\u003e (important in antiviral immunity\u003csup\u003e\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e), and \u003cem\u003eATAD3A\u003c/em\u003e (related to mitochondrial function\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e) (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ed). These findings suggest that these cCREs function as key \"hub\" cCREs orchestrating regulation of multiple genes through extensive interactions, providing new insights into complex epigenetic regulatory network.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e2.8. Causal impact of key cCREs confirmed by perturbation analysis\u003c/h2\u003e \u003cp\u003eTo confirm causal role of high RCS cCREs identified by the model in gene expression regulation, we performed progressive genome perturbation experiments in each cell line across chromosomes. Specifically, for each gene\u0026rsquo;s associated cCREs, we simulated knockout of 5%, 10%, 15%, 20%, 25%, and 30% of high RCS cCREs, ranked high to low by RCS, comparing the results to knockouts of the same proportions from low RCS cCREs. We then predicted gene expression following these perturbations.The perturbation analysis demonstrated decisive impact of high RCS cCREs on gene expression: removing low RCS cCREs did not significantly affect gene expression correlation, indicating minimal contribution. In contrast, removal of high RCSs cCREs caused significant drop in prediction correlation across all chromosomes (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea). Significant difference between high and low RCS cCREs was observed (Wilcoxon test, p\u0026thinsp;\u0026lt;\u0026thinsp;0.0001), confirming that the high RCS cCREs identified by the model have direct causal impact on gene expression (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eFor further validation, we examined the changes in gene expression values. From unseen HCT116 cell line, we selected 100 genes with the smallest prediction error on chromosome 6 as stable genes. Gradually removing 10%, 20%, and 30% of high RCS cCREs resulted in a noticeable decrease in the predicted expression values of most stable genes (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003eb). EEF1E1\u003csup\u003e42\u003c/sup\u003e is a key factor in protein synthesis with a critical role in cellular physiology and pathology and showed significantly decreased expression levels when cCREs with top 10% RCSs were knocked out. Notably, these top 10% cCREs were predominantly located in enhancer and promoter regions.This indicates correlation reduction reflects the substantive contribution of key cCREs identified by DeepCis, consistent with experimental findings where enhancer knockouts decrease target gene expression. This perturbation experiment provides strong evidence for the causal impact of key cCREs and highlights the model\u0026rsquo;s ability to accurately identify core functional cCREs in complex gene regulatory networks.\u003c/p\u003e \u003c/div\u003e"},{"header":"3. Conclusion","content":"\u003cp\u003eThis study presents DeepCis, a novel framework for understanding gene expression regulation mechanisms through integration genomic, epigenomic, and 3D chromatin organization. By combining data across different scales, we capture the complex nature of gene regulation and identify potential CREs and their modes of action. Introduction of \"regulatory contribution score (RCS)\" as a quantitative measure enables clear assessment of each regulatory element\u0026rsquo;s contribution to gene expression, enhancing the model interpretability and providing valuable biological insights into gene regulation mechanisms.\u003c/p\u003e \u003cp\u003eUsing cCREs as a basic unit and constructing their contact matrices with dimensionality reduction strategy effectively reduces the computational burden when processing large-scale Hi-C data, improving the analysis efficiency. In capturing regulatory patterns, the multi-head attention mechanism captures both traditional short-range regulatory effects and the influence of distal cCREs. It successfully reveals the constraints of chromatin organization on gene regulation, indicating that the model understands not only simple linear distances but also how complex spatial organization restricts gene regulation.\u003c/p\u003e \u003cp\u003eDeepCis associate high RCSs cCREs with transcription factor binding sites, revealing which TFs play key roles in activation or repression of gene expression. This provides in-depth insights into how TFs regulate gene expression under different conditions, helping identifying and understanding critical TFs. DeepCis also identifies core regulatory hub cCREs overlapping super-enhancer regions that frequently appear in multiple gene regulatory networks and regulate the expression of key genes. Super-enhancers play a crucial role in cell identity, development, and disease processes. Therefore, these findings provide new insights into function of super-enhancer functions in gene expression regulation and offer important clues for future epigenetic research.\u003c/p\u003e \u003cp\u003eWe validated the biological functions of key cCREs through enrichment analysis and CRISPR experiments, providing comprehensive biological validation of the model effectiveness and reliability. Additionally, we confirmed the key role of high RCS cCREs predicted by the model in gene expression regulation through genomic perturbation experiments from computational perspective. These combined validation results demonstrate that the model achieves high accuracy at both biological and computational levels, providing strong theoretical and experimental support for deeper understanding of gene regulation mechanisms.\u003c/p\u003e \u003cp\u003eDespite the progress in improving prediction accuracy and interpretability of gene regulation mechanisms, this study still has certain limitations. First, while dimensionality reduction reduces the computational burden of Hi-C data through, processing extremely large datasets and higher-resolution Hi-C data may still face computational resources and time limitations. Second, although our model performs well in capturing long-range interactions between cCREs, in some cases, particularly with complex regulatory networks, the model may not fully capture all subtle interactions and regulatory mechanisms. Future research could further optimize the model structure to enhance handling of complex regulatory networks.\u003c/p\u003e \u003cp\u003eIn summary, the predictive and interpretive capabilities of the DeepCis offer broad potential applications in clinical medicine, particularly in precision medicine and disease diagnosis. The model can accurately analyze the impact of non-coding region cCREs on gene regulation, helping to unravel how genes are influenced by multiple regulatory factors and capturing their dynamic interactions, which can assist clinicians in better understanding the mechanisms of diseases. By learning gene regulatory patterns in specific disease states and normal cell lines, the model can identify key cCREs driving diseases. These key cCREs help reveal the underlying regulatory mechanisms behind gene expression abnormalities and can serve as potential drug targets, enabling precise targeted therapies. Despite some limitations, with the continuous accumulation of data and optimization of algorithms, the accuracy and applicability of the model will be further enhanced, offering more personalized and efficient solutions in precision medicine and advancing medical technology.\u003c/p\u003e"},{"header":"4. Methods","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e4.1. Data collection and preprocessing\u003c/h2\u003e \u003cp\u003eData were obtained from publicly available databases, including candidate cis-regulatory elements (cCREs) in five human cell lines (GM12878, HepG2, K562, HeLaS3 and HCT116), chromatin accessibility and epigenomic modification datasets, 3D chromatin interaction maps(Hi-C), and transcription factor ChIP-seq peaks for each cell line. The details are as follows:\u003c/p\u003e \u003cdiv id=\"Sec14\" class=\"Section3\"\u003e \u003ch2\u003e4.1.1. Selection of cCREs.\u003c/h2\u003e \u003cp\u003eTo derive a high-confidence dataset of cCREs from the SCREEN (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://screen.encodeproject.org/\u003c/span\u003e\u003cspan address=\"https://screen.encodeproject.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) database and to enhance computational efficiency in subsequent analyses, we applied a filtering strategy. Specifically, two classes of potentially low-quality cCREs were removed: (i) Low_DNase cCREs, which may correspond to non-functional open chromatin regions; and (ii) DNase_only cCREs, characterized by DNase accessibility in the absence of corroborating epigenomic modifications.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section3\"\u003e \u003ch2\u003e4.1.2. DNA sequence extraction.\u003c/h2\u003e \u003cp\u003eGenomic sequences of cCREs were extracted from the GRCh38 reference genome using bedtools getfasta (v2.31.1). For compatibility with the deep learning model, nucleotide bases were numerically encoded (A\u0026thinsp;=\u0026thinsp;1, T\u0026thinsp;=\u0026thinsp;2, C\u0026thinsp;=\u0026thinsp;3, G\u0026thinsp;=\u0026thinsp;4, N\u0026thinsp;=\u0026thinsp;5). All extracted sequences were subsequently padded to a uniform length of 350 bp (P\u0026thinsp;=\u0026thinsp;0), corresponding to the maximum cCREs length across cell types.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section3\"\u003e \u003ch2\u003e4.1.3. ChIP-seq data processing.\u003c/h2\u003e \u003cp\u003eIn this study, ChIP-seq data corresponding to 16 different assays for cCREs (CTCF, DNase, EP300, H2AFZ, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1, POLRZA, POLR2AphosphoS5) (Supplementary Table\u0026nbsp;1) were downloaded from the ENCODE database(\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.encodeproject.org/\u003c/span\u003e\u003cspan address=\"https://www.encodeproject.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and processed to extract signal features and perform normalization. First, based on the genomic coordinates of the cCREs, signal values at single-base resolution were extracted from BigWig files using the pyBigWig library, generating a site-level signal matrix. The signal results from all chromosomes were then merged, followed by a log(1\u0026thinsp;+\u0026thinsp;x) transformation. Subsequently, the features were normalized to the range [0, 1] using Min-Max scaling, preparing the data for downstream feature analysis and modeling.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section3\"\u003e \u003ch2\u003e4.1.4. Hi-C data processing.\u003c/h2\u003e \u003cp\u003eHi-C data from the 4D Nucleome(\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://4dnucleome.org/\u003c/span\u003e\u003cspan address=\"https://4dnucleome.org/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and ENCODE databases (Supplementary Table\u0026nbsp;1) were processed according to a standardized protocol. For the raw .hic files from the 4D Nucleome, we used the hic2cool (v0.1.1) tool to directly convert them into .cool format files with a 10 kb resolution. For the raw .hic files from ENCODE, we first extracted the 10 kb raw contact frequency matrices using the hicstraw library, and then generated .cool files at the same 10 kb resolution using the cooler tool. To eliminate systematic biases introduced by sequencing depth and technical preferences, all converted .cool format data were normalized using the Iterative Correction and Eigenvector Decomposition (ICE) method.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section3\"\u003e \u003ch2\u003e4.1.5. Construction of cCRE contact matrices based on Hi-C data.\u003c/h2\u003e \u003cp\u003eTo quantitatively analyze the spatial concact frequencies between cCREs, we constructed contact matrices with cCREs as the basic unit using Hi-C data at a 10 kb resolution. Each cCRE segment was mapped to the coordinate system of the Hi-C matrix. For any given cCRE segment, we identified all bins that overlap with its genomic location. If a segment spans multiple genomic bins, these bins collectively represent the position of the cCRE in the Hi-C contact matrix. The contact strength between two cCREs segments was defined as:\u003c/p\u003e \u003cp\u003e \u003cspan class=\"InlineEquation\"\u003e \u003cspan class=\"mathinline\"\u003e\\({I_{ij}}=\\frac{1}{{|{B_i}| \\cdot |{B_j}|}}\\sum\\limits_{{u \\in {B_i}}} {\\sum\\limits_{{v \\in {B_j}}} {{M_{normalized}}(u,v)} }\\)\u003c/span\u003e \u003c/span\u003e \u003c/p\u003e \u003cp\u003eWhere \u003cem\u003eM\u003c/em\u003e\u003csub\u003e\u003cem\u003enormalized\u003c/em\u003e\u003c/sub\u003e (\u003cem\u003eu,v\u003c/em\u003e) represents the normalized Hi-C concact frequency between bin \u003cem\u003ei\u003c/em\u003e and bin \u003cem\u003ej\u003c/em\u003e, \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is the set of Hi-C bins covered by cCRE\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e​, and \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003ej\u003c/em\u003e\u003c/sub\u003e is the set of Hi-C bins covered by cCRE\u003csub\u003e\u003cem\u003ej\u003c/em\u003e\u003c/sub\u003e ​.| \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e |and | \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003ej\u003c/em\u003e\u003c/sub\u003e | represent the number of bins covered by each respective cCRE.\u003c/p\u003e \u003cp\u003eFor each chromosome, cCREs contact matrix \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({M_{CRE}} \\in {R^{n \\times n}}\\)\u003c/span\u003e\u003c/span\u003ewas generated, where \u003cem\u003en\u003c/em\u003e denotes the number of valid cCREs successfully mapped on the chromosome.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec19\" class=\"Section3\"\u003e \u003ch2\u003e4.1.6. Acquisition of condensed spatial features for cCREs.\u003c/h2\u003e \u003cp\u003eTo obtain the spatial features of each cCRE in three-dimensional space, we applied the Truncated Singular Value Decomposition (SVD) dimensionality reduction algorithm to map the high-dimensional contact matrix of cCREs to a lower-dimensional latent space. The top k principal components (where k\u0026thinsp;=\u0026thinsp;32,64) were retained, ensuring that the cumulative explained variance exceeded 50%, preserving the essential topological and distance relationships. The resulting features were then scaled to the range [0, 1] using Min-Max scaling, ensuring consistent feature scaling and allowing the reduced vectors to capture spatial proximity information for the cCREs. In Hi-C data, concact frequencies reflect the proximity of chromatin fragments in three-dimensional space. cCREs with similar interaction patterns correspond to low-dimensional vectors with high similarity, indicating that these segments likely have spatial proximity in the three-dimensional genome and reflect co-localization trends. Therefore, the low-dimensional vectors also approximate the relative positioning of cCREs in three-dimensional space. Additionally, cCREs with similar features in certain principal component dimensions may share similar regulatory functions, reflecting cooperative regulatory trends. In summary, this spatial condensation method not only condenses the high-dimensional contact information between cCREs but also retains the core features of their spatial structure.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section3\"\u003e \u003ch2\u003e4.1.7. RNA data processing.\u003c/h2\u003e \u003cp\u003eFor each gene, the transcriptional start site (TSS) of the transcript with the highest expression was selected as the representative TSS. Additionally, the expression values of all transcripts corresponding to a gene were summed to represent the gene\u0026rsquo;s overall expression level. To address data skewness and stabilize variance, a logarithmic transformation (log\u003csub\u003e10\u003c/sub\u003e (TPM\u0026thinsp;+\u0026thinsp;1)) was applied.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section3\"\u003e \u003ch2\u003e4.1.8. Selection of cCREs in the proximal regions of genes.\u003c/h2\u003e \u003cp\u003eThe TSS of each gene was used to identify cCREs in its upstream and downstream regions. The distance between all cCREs on the same chromosome and the TSS was calculated, and the 90 closest cCREs upstream and downstream were selected. To focus on those most likely involved in gene regulation, only cCREs within 1.5 Mb of the TSS were retained, a distance sufficient to capture typical cCREs.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section3\"\u003e \u003ch2\u003e4.1.9. Transcription factor and target genes retrieval.\u003c/h2\u003e \u003cp\u003eTranscription factors (TFs) were retrieved by performing motif enrichment analysis on all cCREs for each cell line using Homer. Significant motifs (P\u0026thinsp;\u0026lt;\u0026thinsp;0.05) were selected and mapped to corresponding TFs. To ensure data validity, only TFs with available ChIP-seq data in the ENCODE database were retained for analysis. We integrated the target genes of TFs from the TRRUST (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.grnpedia.org/trrust/\u003c/span\u003e\u003cspan address=\"https://www.grnpedia.org/trrust/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) and hTFtarget (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://guolab.wchscu.cn/hTFtarget/#!/\u003c/span\u003e\u003cspan address=\"https://guolab.wchscu.cn/hTFtarget/#!/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) databases.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003e4.1.10. Enhancer and Super-Enhancer Data Retrieval.\u003c/h2\u003e \u003cp\u003eThe enhancer data for GM12878, HepG2, K562, and HeLaS3 cell lines were downloaded from the FANTOM5 database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://fantom.gsc.riken.jp/5/\u003c/span\u003e\u003cspan address=\"https://fantom.gsc.riken.jp/5/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), while the super-enhancers for the GM12878, HepG2, K562, HeLaS3 and HCT116 cell lines were retrieved from the dbSUPER database (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://asntech.org/dbsuper/\u003c/span\u003e\u003cspan address=\"https://asntech.org/dbsuper/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Dataset Construction\u003c/h2\u003e \u003cp\u003eFor each cell line, each gene was treated as a sample. The model's input for each gene comprised a concatenated sequence of its cCREs, which were individually represented by a 350-base sequence. Each base was encoded with a 6-dimensional one-hot vector and 16-dimensional epigenetic features. Spatial information was captured by a 32-dimensional spatial features for each element, supplemented by linear chromosomal positions and gene-level positional data. A mask vector was used to handle variable numbers of cCREs per gene. The target labels for the model were the transformed gene expression values.\u003c/p\u003e \u003cp\u003eTo ensure the model focused solely on the impact of cCREs, gene sequences and their corresponding epigenetic signals at the gene locus were intentionally masked. This approach prevented information leakage and allowed the model to learn the regulatory relationship through the \u0026ldquo;pathway\u0026rdquo; between cCREs and their target genes.\u003c/p\u003e \u003cp\u003eTo evaluate the model\u0026rsquo;s generalization and robustness across different genomic contexts, a chromosome-based partitioning strategy was employed. Chromosome 16 was designated as the validation set due to its enrichment in low-copy repeat sequences and structural variation hotspots\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e,\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e, presenting a challenge for the model. Chromosome 6 was chosen as the test set, as its moderate length, high gene density, and complex regions \u003csup\u003e\u003cspan additionalcitationids=\"CR46\" citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003emake it a representative and robust benchmark for evaluating performance on unseen genomic contexts. All other autosomes were used for training.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec25\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Model Design\u003c/h2\u003e \u003cdiv id=\"Sec26\" class=\"Section3\"\u003e \u003ch2\u003e4.3.1. Model Architecture Design.\u003c/h2\u003e \u003cp\u003eDeepCis integrates information from cCREs and genes to predict target gene expression levels and model regulatory relationships using two cascaded Transformer encoders. The core of the Transformer encoder is the multi-head attention mechanism, where attention weights are calculated for each token in each head relative to other tokens in the sequence, capturing global contextual relationships\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e. The outputs of all heads are concatenated and passed through a linear transformation to obtain the final multi-head output.\u003c/p\u003e \u003cp\u003eThe model first maps genomic sequence features, epigenetic features, and spatial embedding features into continuous 256-dimensional vectors. These mapped features are concatenated along their dimensions and passed through a linear layer and layer normalization to form a unified 256-dimensional joint feature. In the feature extraction phase, this joint feature is input into a 1D convolutional module with a stride of 5 to extract local patterns and downsample the sequence length. The extracted feature sequence is then fed into the first Transformer encoder, where relative positional bias is applied to incorporate base positional information. The 256-dimensional representation of the final token from the encoder output is used as an initial feature for the cCREs or genes. The spatial embedding features are then added to this initial feature to form the final feature representation for each cCRE and gene.\u003c/p\u003e \u003cp\u003eIn subsequent modeling, the target gene and its 180 corresponding cCREs serve as input sequences to the second Transformer encoder. Multi-scale relative positional biases are constructed based on the genomic coordinates of the cCREs and target gene, enabling the model to effectively incorporate the impact of three-dimensional spatial constraints on gene regulation at different resolutions. Finally, the output of the second encoder is layer-normalized, and the feature representation corresponding to the target gene token is extracted and mapped to a one-dimensional output via a fully connected layer to predict the target gene\u0026rsquo;s expression.\u003c/p\u003e \u003cp\u003eWe extracted the attention scores from the final layer of the second encoder and aggregated the attention values across all positions by averaging. The resulting score, termed the Regulatory Contribution Score (RCS), quantifies the degree of attention a gene gives to cCREs or the attention one cCRE gives to another, thereby quantifying the contribution of each cCRE to gene expression regulation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec27\" class=\"Section3\"\u003e \u003ch2\u003e4.3.2 Dual masks design.\u003c/h2\u003e \u003cp\u003eTo reflect the biological directionality of regulation, a base mask is introduced in the encoder to restrict information flow, allowing cCREs to focus on genes while limiting the reverse flow from cCREs to genes, modeling the directionality of regulation. Another mask in DeepCis allows for the selective masking of one or more cCREs within the input, enabling precise investigation of the role of cCREs.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec28\" class=\"Section3\"\u003e \u003ch2\u003e4.3.3 Model Training.\u003c/h2\u003e \u003cp\u003eThe model was trained using the AdamW optimizer with a ReduceLROnPlateau learning rate scheduler, which dynamically adjusts the learning rate based on the validation loss. The loss function employed was mean squared error (MSE). To prevent overfitting, early stopping was applied, halting training if the validation loss did not improve for 15 consecutive epochs. Dropout with a rate of 0.1 was used during training. In the initialization phase, the model was pre-initialized with the CREaTor\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e pretrained model, with incompatible parameters initialized using Xavier and Kaiming random initialization. The DNA embedding layer was frozen, while all other layers were updated during training.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec29\" class=\"Section3\"\u003e \u003ch2\u003e4.3.4. Model Comparison.\u003c/h2\u003e \u003cp\u003eTo further evaluate the performance of our model, we referred to the K562 functional enhancer dataset, validated through CRISPR/Cas9 gene editing, which is provided in CREaTor\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e. Using the results from CREaTor, we compared the performance of DeepCis, ABC, Enformer, GraphReg, and CREaTor in predicting enhancers in the K562 cell line.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec30\" class=\"Section2\"\u003e \u003ch2\u003e4.4. Definition of Regulatory Contribution Element Grouping\u003c/h2\u003e \u003cp\u003eFor each gene, the corresponding cCREs are ranked by their contribution scores RCS from high to low. The top 10% of cCREs are defined as the \"high regulatory contribution group,\" while the bottom 10% are defined as the \"low regulatory contribution group.\"\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec31\" class=\"Section2\"\u003e \u003ch2\u003e4.5. Identification of Optimal Epigenomic Feature Combinations\u003c/h2\u003e \u003cp\u003eWe performed an exhaustive search to identify the epigenomic features combinations that maximize the Pearson correlation coefficient on the test set for chromosome 6, retaining 2 to 6 features for each cell line.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec32\" class=\"Section2\"\u003e \u003ch2\u003e4.6. TAD Identification\u003c/h2\u003e \u003cp\u003eIn this study, we used 40 kb resolution Hi-C data and the hicFindTADs tool to identify TADs in various cell lines.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec33\" class=\"Section2\"\u003e \u003ch2\u003e4.7. Validation of regulatory direction using TCGA data\u003c/h2\u003e \u003cp\u003eTo validate the regulatory direction predicted by DeepCis, we acquired TCGA datasets from the Genomic Data Commons (GDC) Data Portal(\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://portal.gdc.cancer.gov/\u003c/span\u003e\u003cspan address=\"https://portal.gdc.cancer.gov/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), encompassing multiple cancer types: diffuse large B-cell lymphoma (TCGA-DLBC), liver hepatocellular carcinoma (TCGA-LIHC), acute myeloid leukemia (TCGA-LAML), cervical squamous cell carcinoma and endocervical adenocarcinoma (TCGA-CESC), and colon adenocarcinoma (TCGA-COAD).For samples in these datasets, we calculated the correlation between the expression levels of TFs and their corresponding target genes. We then compared these correlations with the relationships between the predicted RCSs and the ChIP-seq signal intensities of the corresponding TFs. By examining whether the signs of the correlations were consistent, we validated the consistency of the regulatory direction. A match in sign between the correlation of transcription factor-target gene expression and the model's predicted RCS scores indicates that the regulatory direction is aligned, supporting the accuracy of our model.\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003ch2\u003eConflicts of Interest\u003c/h2\u003e\n\u003cp\u003eThe authors declare no conflicts of interest.\u003c/p\u003e\n\u003ch2\u003eFunding:\u003c/h2\u003e\n\u003cp\u003eThis work was supported by the National Natural Science Foundation of China (32170674).\u003c/p\u003e\n\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\n\u003cp\u003eH.Z., S.W.N. and P.W. conceived the research direction and guided the research design. X.R.L. designed the study, drafted the initial manuscript and performed data analysis. W.B.D., J.J.D., M.M.L. and S.S.W. conducted data collection. M.Y.X., Y.L.X., H.B.Z. and Y.S.Z. verified the research results. W.F. and Y.J.L. were responsible for data visualization.All authors critically revised the manuscript for important intellectual content and approved the final version for publication.\u003c/p\u003e\n\u003ch2\u003eAcknowledgements\u003c/h2\u003e\n\u003cp\u003eThe author thanked the National Natural Science Foundation of China (32170674).\u003c/p\u003e\n\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eThe data that support the findings of this study are available upon request. The following databases were used in this research: SCREEN ( [http://screen.encodeproject.org/](http:/screen.encodeproject.org) ), ENCODE ( [https://www.encodeproject.org/](https:/www.encodeproject.org) ), 4D Nucleome ( [https://4dnucleome.org/](https:/4dnucleome.org) ), FANTOM5 ( [https://fantom.gsc.riken.jp/5/](https:/fantom.gsc.riken.jp/5) ), TRRUST ( [https://www.grnpedia.org/trrust/](https:/www.grnpedia.org/trrust) ), hTFtarget ( [https://guolab.wchscu.cn/hTFtarget/#!/](https:/guolab.wchscu.cn/hTFtarget) ), Genomic Data Commons (GDC) ( [https://portal.gdc.cancer.gov/](https:/portal.gdc.cancer.gov) ), and dbSUPER ( [https://asntech.org/dbsuper/](https:/asntech.org/dbsuper) ).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eWittkopp, P. J. \u0026amp; Kalay, G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. \u003cem\u003eNat. Rev. Genet.\u003c/em\u003e \u003cstrong\u003e13\u003c/strong\u003e, 59\u0026ndash;69 (2012).\u003c/li\u003e\n\u003cli\u003eOng, C.-T. \u0026amp; Corces, V. G. Enhancer function: new insights into the regulation of tissue-specific gene expression. \u003cem\u003eNat. Rev. Genet.\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 283\u0026ndash;293 (2011).\u003c/li\u003e\n\u003cli\u003eLee, T. I. \u0026amp; Young, R. A. Transcriptional Regulation and Its Misregulation in Disease. \u003cem\u003eCell\u003c/em\u003e \u003cstrong\u003e152\u003c/strong\u003e, 1237\u0026ndash;1251 (2013).\u003c/li\u003e\n\u003cli\u003eCis-Regulatory Alterations in FOXP4 Modulate Esophageal Cancer Susceptibility Induced by Chronic Alcohol Exposure - PubMed. https://pubmed.ncbi.nlm.nih.gov/40228145/.\u003c/li\u003e\n\u003cli\u003eCastro-Mondragon, J. A. \u003cem\u003eet al.\u003c/em\u003e Cis-regulatory mutations associate with transcriptional and post-transcriptional deregulation of gene regulatory programs in cancers. \u003cem\u003eNucleic Acids Res.\u003c/em\u003e \u003cstrong\u003e50\u003c/strong\u003e, 12131\u0026ndash;12148 (2022).\u003c/li\u003e\n\u003cli\u003eKim, K. L. \u003cem\u003eet al.\u003c/em\u003e Dissection of a CTCF topological boundary uncovers principles of enhancer-oncogene regulation. \u003cem\u003eMol. Cell\u003c/em\u003e \u003cstrong\u003e84\u003c/strong\u003e, 1365-1376.e7 (2024).\u003c/li\u003e\n\u003cli\u003eZhao, S. G. \u003cem\u003eet al.\u003c/em\u003e Integrated analyses highlight interactions between the three-dimensional genome and DNA, RNA and epigenomic alterations in metastatic prostate cancer. \u003cem\u003eNat. Genet.\u003c/em\u003e \u003cstrong\u003e56\u003c/strong\u003e, 1689\u0026ndash;1700 (2024).\u003c/li\u003e\n\u003cli\u003eENCODE data at the ENCODE portal - PubMed. https://pubmed.ncbi.nlm.nih.gov/26527727/.\u003c/li\u003e\n\u003cli\u003eThomas, H. F. \u003cem\u003eet al.\u003c/em\u003e Enhancer cooperativity can compensate for loss of activity over large genomic distances. \u003cem\u003eMol. Cell\u003c/em\u003e \u003cstrong\u003e85\u003c/strong\u003e, 362-375.e9 (2025).\u003c/li\u003e\n\u003cli\u003eCosta, J. R. \u003cem\u003eet al.\u003c/em\u003e Transcription factor cooperativity at a GATA3 tandem DNA sequence determines oncogenic enhancer-mediated activation. \u003cem\u003eCell Rep.\u003c/em\u003e \u003cstrong\u003e44\u003c/strong\u003e, 115705 (2025).\u003c/li\u003e\n\u003cli\u003eEnhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning | Nature Communications. https://www.nature.com/articles/s41467-021-23576-0#ethics.\u003c/li\u003e\n\u003cli\u003eHi\u0026ndash;C: A comprehensive technique to capture the conformation of genomes. \u003cem\u003eMethods\u003c/em\u003e \u003cstrong\u003e58\u003c/strong\u003e, 268\u0026ndash;276 (2012).\u003c/li\u003e\n\u003cli\u003eChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing | Genome Biology. https://link.springer.com/article/10.1186/gb-2010-11-2-r22.\u003c/li\u003e\n\u003cli\u003eHiChIP: efficient and sensitive analysis of protein-directed genome architecture | Nature Methods. https://www.nature.com/articles/nmeth.3999.\u003c/li\u003e\n\u003cli\u003eProximity Ligation-Assisted ChIP-Seq (PLAC-Seq) | SpringerLink. https://link.springer.com/protocol/10.1007/978-1-0716-1597-3_10.\u003c/li\u003e\n\u003cli\u003eFulco, C. P. \u003cem\u003eet al.\u003c/em\u003e Activity-by-Contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. \u003cem\u003eNat. Genet.\u003c/em\u003e \u003cstrong\u003e51\u003c/strong\u003e, 1664\u0026ndash;1669 (2019).\u003c/li\u003e\n\u003cli\u003eAhmed, F. S., Aly, S. \u0026amp; Liu, X. EPI-Trans: an effective transformer-based deep learning model for enhancer promoter interaction prediction. \u003cem\u003eBMC Bioinformatics\u003c/em\u003e \u003cstrong\u003e25\u003c/strong\u003e, 216 (2024).\u003c/li\u003e\n\u003cli\u003eZhang, T., Zhao, X., Sun, H., Gao, B. \u0026amp; Liu, X. GATv2EPI: Predicting Enhancer-Promoter Interactions with a Dynamic Graph Attention Network. \u003cem\u003eGenes\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 1511 (2024).\u003c/li\u003e\n\u003cli\u003eKansformerEPI: a deep learning framework integrating KAN and transformer for predicting enhancer-promoter interactions - PubMed. https://pubmed.ncbi.nlm.nih.gov/40515390/.\u003c/li\u003e\n\u003cli\u003eKelley, D. R. Cross-species regulatory sequence activity prediction. \u003cem\u003ePLoS Comput. Biol.\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, e1008050 (2020).\u003c/li\u003e\n\u003cli\u003eAvsec, Ž. \u003cem\u003eet al.\u003c/em\u003e Effective gene expression prediction from sequence by integrating long-range interactions. \u003cem\u003eNat. Methods\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, 1196\u0026ndash;1203 (2021).\u003c/li\u003e\n\u003cli\u003eChromatin interaction-aware gene regulatory modeling with graph attention networks - PubMed. https://pubmed.ncbi.nlm.nih.gov/35396274/.\u003c/li\u003e\n\u003cli\u003eLi, Y. \u003cem\u003eet al.\u003c/em\u003e CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 266 (2023).\u003c/li\u003e\n\u003cli\u003eIntegrative analysis reveals functional and regulatory roles of H3K79me2 in mediating alternative splicing | Genome Medicine. https://link.springer.com/article/10.1186/s13073-018-0538-1.\u003c/li\u003e\n\u003cli\u003ePrecise targeting of POLR2A as a therapeutic strategy for human triple negative breast cancer | Nature Nanotechnology. https://www.nature.com/articles/s41565-019-0381-6.\u003c/li\u003e\n\u003cli\u003eBclaf1 mediates super-enhancer-driven activation of POLR2A to enhance chromatin accessibility in nitrosamine-induced esophageal carcinogenesis. \u003cem\u003eJ. Hazard. Mater.\u003c/em\u003e \u003cstrong\u003e492\u003c/strong\u003e, 138218 (2025).\u003c/li\u003e\n\u003cli\u003eWhy the activity of a gene depends on its neighbors: Trends in Genetics. https://www.cell.com/trends/genetics/abstract/S0168-9525(15)00129-8.\u003c/li\u003e\n\u003cli\u003eSchindler, M. \u003cem\u003eet al.\u003c/em\u003e Induction of kidney-related gene programs through co-option of SALL1 in mole ovotestes. \u003cem\u003eDevelopment\u003c/em\u003e \u003cstrong\u003e150\u003c/strong\u003e, dev201562 (2023).\u003c/li\u003e\n\u003cli\u003eYoung LINE-1 transposon 5\u0026prime; UTRs marked by elongation factor ELL3 function as enhancers to regulate na\u0026iuml;ve pluripotency in embryonic stem cells | Nature Cell Biology. https://www.nature.com/articles/s41556-023-01211-y.\u003c/li\u003e\n\u003cli\u003ePromoter-proximal CTCF binding promotes distal enhancer-dependent gene activation | Nature Structural \u0026amp; Molecular Biology. https://www.nature.com/articles/s41594-020-00539-5.\u003c/li\u003e\n\u003cli\u003eSMARCB1 loss activates patient-specific distal oncogenic enhancers in malignant rhabdoid tumors | Nature Communications. https://www.nature.com/articles/s41467-023-43498-3.\u003c/li\u003e\n\u003cli\u003eTAD boundary deletion causes PITX2-related cardiac electrical and structural defects | Nature Communications. https://www.nature.com/articles/s41467-024-47739-x.\u003c/li\u003e\n\u003cli\u003eCUX1 restrains latent hematopoietic stem cell plasticity by suppressing stem cell-intrinsic inflammatory pathways. \u003cem\u003eBlood\u003c/em\u003e https://doi.org/10.1182/blood.2024026815 (2025) doi:10.1182/blood.2024026815.\u003c/li\u003e\n\u003cli\u003eFOXP1 directly represses transcription of proapoptotic genes and cooperates with NF-\u0026kappa;B to promote survival of human B cells | Blood | American Society of Hematology. https://ashpublications.org/blood/article/124/23/3431/33456/FOXP1-directly-represses-transcription-of.\u003c/li\u003e\n\u003cli\u003eMechanism of ERBB2 gene overexpression by the formation of super-enhancer with genomic structural abnormalities in lung adenocarcinoma without clinically actionable genetic alterations | Molecular Cancer. https://link.springer.com/article/10.1186/s12943-024-02035-6.\u003c/li\u003e\n\u003cli\u003eSuper-enhancer-driven SLCO4A1-AS1 is a new biomarker and a promising therapeutic target in glioblastoma | Scientific Reports. https://www.nature.com/articles/s41598-024-82109-z.\u003c/li\u003e\n\u003cli\u003eDysregulated CREB3 cleavage at the nuclear membrane induces karyoptosis-mediated cell death | Experimental \u0026amp; Molecular Medicine. https://www.nature.com/articles/s12276-024-01195-1.\u003c/li\u003e\n\u003cli\u003eTLN1: an oncogene associated with tumorigenesis and progression | Discover Oncology. https://link.springer.com/article/10.1007/s12672-024-01593-x.\u003c/li\u003e\n\u003cli\u003eTargeting CA9 restricts pancreatic cancer progression through pH regulation and ROS production | Cellular Oncology. https://link.springer.com/article/10.1007/s13402-024-01022-9.\u003c/li\u003e\n\u003cli\u003eThe diverse repertoire of ISG15: more intricate than initially thought | Experimental \u0026amp; Molecular Medicine. https://www.nature.com/articles/s12276-022-00872-3.\u003c/li\u003e\n\u003cli\u003eATAD3A: A Key Regulator of Mitochondria-Associated Diseases. https://www.mdpi.com/1422-0067/24/15/12511.\u003c/li\u003e\n\u003cli\u003eA Novel HCC Prognosis Predictor EEF1E1 Is Related to Immune Infiltration and May Be Involved in EEF1E1/ATM/p53 Signaling - PubMed. https://pubmed.ncbi.nlm.nih.gov/34282404/.\u003c/li\u003e\n\u003cli\u003eClinical Implications of Chromosome 16 Copy Number Variation | Molecular Syndromology | Karger Publishers. https://karger.com/msy/article-abstract/13/3/184/825184/Clinical-Implications-of-Chromosome-16-Copy-Number?redirectedFrom=fulltext.\u003c/li\u003e\n\u003cli\u003eMartin, J. \u003cem\u003eet al.\u003c/em\u003e The sequence and analysis of duplication-rich human chromosome 16. \u003cem\u003eNature\u003c/em\u003e \u003cstrong\u003e432\u003c/strong\u003e, 988\u0026ndash;994 (2004).\u003c/li\u003e\n\u003cli\u003eCopy Neutral LOH Affecting the Entire Chromosome 6 Is a Frequent Mechanism of HLA Class I Alterations in Cancer. https://www.mdpi.com/2072-6694/13/20/5046.\u003c/li\u003e\n\u003cli\u003eDNA Methylation and Transcription of HLA-F and Serum Cytokines Relate to Chinese Medicine Syndrome Classification in Patients with Chronic Hepatitis B | Chinese Journal of Integrative Medicine. https://link.springer.com/article/10.1007/s11655-021-3279-8.\u003c/li\u003e\n\u003cli\u003eThomas, C. \u003cem\u003eet al.\u003c/em\u003e TERT promoter mutation and chromosome 6 loss define a high-risk subtype of ependymoma evolving from posterior fossa subependymoma. \u003cem\u003eActa Neuropathol. (Berl.)\u003c/em\u003e \u003cstrong\u003e141\u003c/strong\u003e, 959\u0026ndash;970 (2021).\u003c/li\u003e\n\u003cli\u003eVaswani, A. \u003cem\u003eet al.\u003c/em\u003e Attention is All you Need. in \u003cem\u003eAdvances in Neural Information Processing Systems\u003c/em\u003e vol. 30 (Curran Associates, Inc., 2017).\u003c/li\u003e\n\u003cli\u003eLi, Y. \u003cem\u003eet al.\u003c/em\u003e CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms. \u003cem\u003eGenome Biol.\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 266 (2023).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"DeepCis, Cis-regulatory elements, Multi-omics, Gene regulation","lastPublishedDoi":"10.21203/rs.3.rs-8988269/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8988269/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eDysregulation of gene expression is a hallmark of cancer, wherein cis-regulatory elements(CREs) in non-coding genomic regions play pivotal roles in precise transcriptional control. Although numerous candidate CREs(cCREs) have been identified, the complex regulatory relationships between cCREs and their target genes remain poorly characterized, and current computational approaches are inadequate for large-scale prediction of functional cCREs. Here, we introduce DeepCis, a Transformer-based multi-omics framework that integrates base-level genomic sequence and epigenomics features, as well as region-level 3D chromatin organization to capture regulatory information across multiple scales. DeepCis quantitatively assesses the contribution of individual cCREs to gene regulation and reveals the cooperative interactions among cCREs. Our framework demonstrates robust performance in gene expression prediction, effectively identifying key cCREs and their specific contributions to transcriptional output. The model captures both proximal dominance and distal activity of cCREs and elucidates how 3D chromatin organization constrains gene regulation. Furthtermore, DeepCis successfully identifies transcription factor binding sites associated with active regulation. Through comprehensive multi-level interpretability analyses, we validate the reliability and effectiveness of the DeepCis in deciphering gene regulation mechanisms, providing a powerful foundation for advancing our understanding of transcripgional regulation and enabling applications in precision medicine.\u003c/p\u003e","manuscriptTitle":"DeepCis: A Transformer-based Multi-Omics Framework for Cross-Scale Explainability in Gene Regulatory Element Identification","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-11 16:30:56","doi":"10.21203/rs.3.rs-8988269/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"291124535812391456626523260606215091425","date":"2026-04-27T16:41:07+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-06T01:44:06+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-03-03T13:23:46+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-28T12:19:48+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-28T12:17:43+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2026-02-27T12:43:54+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"bbc68a43-4f2b-4a45-9df1-c2d3d7c36129","owner":[],"postedDate":"March 11th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":64111919,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":64111920,"name":"Biological sciences/Genetics"}],"tags":[],"updatedAt":"2026-03-11T16:30:56+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-11 16:30:56","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8988269","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8988269","identity":"rs-8988269","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00