Modeling nascent transcription from chromatin landscape and structure

doi:10.1101/2024.06.04.597340

Modeling nascent transcription from chromatin landscape and structure

2024 · doi:10.1101/2024.06.04.597340

preprint OA: closed CC-BY-NC-4.0

📄 Open PDF Full text JSON View at publisher

Full text 82,951 characters · extracted from oa-pdf · 14 sections · click to expand

Abstract

29

Background

30 Different cell types and their associated functionalities emerge from a single genomic sequence 31 when certain regions are expressed while others remain silenced. Modeling gene expression and 32 its potential malfunctioning in different cellular contexts is hence pivotal to understand both 33 development and disease. 34 35

Results

36 We present the Chromatin Landscape and Structure to Expression Regressor (CLASTER), an 37 epigenetic-based deep neural network that can integrate different data modalities describing the 38 chromatin landscape and its 3D structure. CLASTER effectively translates them into nascent 39 transcription levels measured by EU-seq at a kilobasepair resolution. Our predictions reached a 40 Pearson correlation with targets above r=0.86 at both bin and gene levels, without relying on 41 DNA sequence nor explicitly extracted chromatin features. The model mostly used the 42 information found within 10 kbp of the predicted locus, even when a wide genomic region of 1 43 Mbp was available. Explicit modeling of long-range interactions using multi-headed attention 44 and high-resolution chromatin contact maps had little impact on model performance, despite the 45 model correctly identifying elements in these inputs influencing nascent transcription. The 46 trained model served then as a platform to predict the transcriptional impact of simulated 47 epigenetic silencing perturbations. 48 49

Conclusions

50 Our results point towards a rather local, integrative and combinatorial paradigm of gene 51 regulation, where changes in the chromatin environment surrounding a gene shape its context-52 specific transcription. We conclude that the predominant locality and limitations of current 53 machine learning approaches might emerge as a genuine signature of genomic organization, 54 having broad implications for future modeling approaches. 55 56

Keywords

57 EU-seq, nascent transcription, chromatin landscape, Micro-C, deep neural networks 58 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 3

Background

59 A single genome gives rise to a wide variety of cells, with diverse morphological properties and 60 functions. These features of the cells emerge when certain regions of the DNA are expressed 61 while others are kept silent, a process known as gene regulation. The rules governing such 62 regulatory processes in different cellular contexts still constitute a highly debated topic, and their 63 understanding is crucial to study development and disease. Many models of gene regulation have 64 already been proposed approaching the problem at different biological layers [1–5]. At a high 65 level, gene expression is modulated by the interplay of a set of regulatory elements, e.g. 66 enhancers and promoters, which have been shown to be broadly compatible in most cases [6,7]. 67 Their activities are usually described using a number of epigenetic marks and can be combined 68 multiplicatively to shape RNA transcription [7–9]. The establishment of functional pairings 69 between these regulatory elements is then conditioned by their contact frequency, given by the 70 three-dimensional structure of chromatin. Models such as the Activity by Contact (ABC) [10] 71 and the more recent ENCODE-rE2G [11] exploit these features. 72 73 Regulatory elements and their associated genes are ultimately encoded in the DNA. Many 74 computational approaches have therefore put a special emphasis on the genomic sequence, 75 identifying patterns and motifs, integrating them and predicting a set of output tracks describing 76 various downstream genomic and epigenomic processes [1–5]. The advent of large language 77 models (LLMs) is also powering a new wave of sequence-based machine learning models such 78 as Nucleotide Transformer [12] , Borzoi [13] , DNABERT [14] , HyenaDNA [15] or Evo [16]. 79 These present improvements in several fronts, e.g., being more robust to genomic variability, 80 extending the studied genomic regions or achieving higher resolutions while bringing down the 81 computational costs. Finally, new advances in the 3D organization field and chromatin 82 conformation capture techniques have brought to light finer structural details (Micro-C) [17,18] 83 and also show promise to improve predictive modeling [19,20]. 84 85 However, this explosion in the genomics analysis toolbox is mostly inherited from the computer 86 vision and natural language processing fields, which pursue different objectives and describe 87 data modalities that attend to highly different sets of rules. This is best illustrated by the fact that 88 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 4 even some of the most advanced machine learning models rely primarily on the close 89 neighborhood of the predicted points, without paying attention to most of the sequence provided 90 [9]. This is of particular interest in the case of attention-based deep learning models, which 91 handle the complete sequences at once and therefore do not impose any spatial constraints. This 92 raises the question: are these machine learning models predominantly local because of current 93 technical limitations and training schemes, or is this a property of the underlying data 94 distribution, i.e., a signature of genomic organization induced by elements such as evolutionary 95 pressure? 96 97 Here we present Chromatin Landscape and Structure to Expression Regressor (CLASTER), a 98 deep convolutional neural network aimed to learn the mapping from a given chromatin landscape 99 and its three-dimensional structure to the corresponding nascent RNA transcription profiles. 100 CLASTER constitutes a cell-type agnostic model that integrates properties of both chromatin 101 state and structure without relying on the genomic sequence. The model can identify patterns 102 between chromatin marks and across the genome at several levels of abstraction and combine 103 them in a non-linear fashion. Of note, the multiplicative nature of enhancer and promoter 104 activities deeply motivates the use of non-linearities, which would not be captured by linear 105 models [1]. Similar principles have already been successfully applied in gene expression 106 classification [21,22] and regression [19], and contact map prediction [23,24]. Motivated by 107 recent findings of broad compatibility between regulatory elements, we identified the core set of 108 elements required to predict the nascent transcription landscape from an epigenetic perspective. 109 The trained models were then used to explore the power of new perturbational approaches on 110 such data to simulate the transcriptional impact of epigenetic remodeling, which has already 111 shown great therapeutic potential [25]. 112 113

Results

114 CLASTER integrated multi-modal chromatin data to predict nascent RNA levels in a 115 sequence-agnostic manner 116 We developed CLASTER, a deep neural network designed using the EIR framework [26] and 117 trained it to translate a given chromatin landscape and its corresponding high-resolution 3D 118 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 5 contact frequency map to the matching target nascent RNA profiles (Figure 1a, Supplementary 119 Figure 1, Supplementary Table 1 ). Nascent transcription levels were obtained for mouse 120 Embryonic Stem Cells (mESCs) from Narita et al. 2021 [27], where they had been measured by 121 RNA labeling with 5-ethynyl uridine and sequencing (EU-seq). CLASTER was trained to predict 122 kilobasepair resolution EU-seq profiles in a wide region, spanning from -200 kbp to 200 kbp 123 from the TSS of a gene of interest that was located at the center of the predicted window (Figure 124 1b, Supplementary Figure 2, Methods ). C hromosomes 17 and 4 were used to obtain our 125 validation and test set regions, respectively (Supplementary Table 2). 126 127 To perform the predictions, the network used two matrices representing the different data 128 modalities, which were processed in separate branches of the model. The first branch processed 129 four complementary genomic tracks that characterized the epigenetic landscape in a chromatin 130 region of 1Mbp at a 100bp resolution. These four core tracks describe chromatin accessibility 131 (ATAC-seq), promoter activity (H3K4me3), promoter and enhancer activity (H3K27ac) and 132 heterochromatin (H3K27me3), respectively. An optional second branch of the model was aimed 133 at processing high resolution structural data. This was given in the form of Micro-C contact 134 frequency maps exploring also 1Mbp of genomic context at a 1.6 kbp resolution. These maps 135 described contact frequencies between different parts of the genomic region explored in each 136 sample and were obtained for the same cell line from [28]. The model then extracted features in 137 both data modalities separately using a set of dilated convolutions and residual connections 138 (Figure 1a, Supplementary Figure 1 ). Optional multi-headed attention layers were then added 139 to the landscape branch to ease the processing of the sequential appearance of such features and 140 long-range interactions between them. The resulting high-level features were then combined and 141 integrated using a set of fully connected layers, and finally a dense layer connected the last 142 hidden layer to each output separately. 143 144 CLASTER displayed a high test prediction accuracy, with Spearman and Pearson correlations 145 between log transformed predicted values and ground truth signal intensities of r s=0.7717 and 146 rp=0.8690 across all predicted bins ( Figure 1c). Test performances increased to r s=0.9275 and 147 rp=0.8859 when comparing the expected nascent transcription levels inside target gene 148 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 6 boundaries (Figure 1d, Methods), where a lower number of near zero values would be expected.149 In summary, CLASTER predicted nascent RNA levels given the corresponding chromatin150 landscape and its matching high- resolution contact frequency map. This was achieved by151 extracting features at different levels of abstraction from both data modalities without relying on152 the DNA sequence. Of note, these predictions comprised most protein coding genes in153 chromosome 4, with no filters on specific ranges of expression levels nor their enhancer -154 dependence status. 155 156 157 6 d. tin by on in - .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 7 Fig 1. CLASTER can translate the chromatin landscape and structure to the 158 corresponding nascent RNA profiles in mESCs. a) Model schema. CLASTER maps the 159 epigenetic landscape of a chromatin region of 1Mbp and its matching contact frequency map to 160 the corresponding nascent expression levels in a window of 401 kbp. b) Targets and predictions 161 around the gene Tmem222 (chr4) for the different models. Prediction ranges were adapted to the 162 context length of each model. c) Density plot showing log-transformed, binwise predictions vs. 163 target values in all test samples, i.e. across protein coding genes in chromosome 4. d) Length 164 normalized nascent transcription levels for the central gene in each predicted sample. 165 Normalized transcription levels were obtained by integrating the area under the EU-seq curve 166 inside the gene boundaries and normalizing by gene length. e) Pearson and Spearman 167 correlations between test predictions and targets at binwise and gene levels for the different 168 models. 169 170 Nascent transcription can be predicted from DNA sequence 171 The enrichment levels of different chromatin marks in a given locus, which define the chromatin 172 states [29], may be interpreted as epigenetic embeddings containing relevant functional 173 information of that genomic region. This makes them somewhat similar in role to the inner 174 sequence representations of DNA-based models like the Enformer [2] or HyenaDNA [15], and 175 therefore we wanted to compare their respective capabilities to predict the nascent RNA 176 landscape described by EU-seq. 177 178 Sequence embeddings of the central 160kbp regions matching our samples were obtained for the 179 Enformer model with pretrained, frozen weights (Supplementary Figure 3, Methods). We then 180 trained a simple head mapping the embeddings to the target EU-seq outputs ( Supplementary 181 Figures 4-5, Methods ). The performance of the model head on Enformer’s pretrained 182 embeddings, i.e., with no fine-tuning nor further training required, achieved Pearson and 183 Spearman correlations of r s=0.8067 and r p=0.8739 for the test predictions ( Figure 1e, 184 Supplementary Figures 6-8 ). However, the target window had to be shortened given 185 Enformer’s narrower context length and cropping of the latent sequence representations (Figure 186 1b, Supplementary Figure 8 ). Sequence embeddings obtained by pretraining the DNA 187 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 8 language models HyenaDNA-160-kbp and 32-kbp in a next nucleotide prediction task could not 188 be used in the same manner to directly predict nascent expression levels in a regression fashion 189 (Methods). The embeddings could however be tailored to our task by fine-tuning the pretrained 190 backbone and the added head together. Predictions using a fine-tuned HyenaDNA-32k model 191 displayed Pearson and Spearman correlation coefficients of rs=0.7514 and rp=0.7403 (Figure 1e, 192 Supplementary Figure 6). 193 In conclusion, DNA-sequence based models trained in a supervised manner like the Enformer 194 provide embeddings encoding valuable functional information, which can be directly probed to 195 predict nascent transcription levels. These results were to be expected since the Enformer was 196 originally trained to predict 1,643 mouse genomic tracks, including RNAPII binding and a 197 number of CAGE experiments that were highly related to our outputs. HyenaDNA, a genomic 198 sequence foundation model, could also be used to predict the nascent transcription landscape. 199 However, sequence embeddings obtained from pretraining HyenaDNA in a next nucleotide 200 prediction task did not encode the information necessary to predict our targets, requiring further 201 fine-tuning of both HyenaDNA backbone and the added model head. 202 203 CLASTER learned the functional roles of chromatin states rather than those of specific 204 marks 205 In silico perturbations of the inputs can allow us to retrieve some of the information learned by 206 the models, while enabling the simulation of arbitrary epigenetic variability. They might be used, 207 for instance, to identify potential transcriptional changes induced by an epigenetic silencing of 208 chromatin, a valuable alternative to genetic knock-out [25]. To this end, we performed virtual 209 perturbational experiments aimed to simulate a loss of the activity of single active enhancers 210 using only the chromatin landscape branch of CLASTER ( Methods). Proximal and distal 211 Enhancer Like Signatures (pELS, dELS), were obtained from [30] (Supplementary Table 3). 212 213 We found that the enrichment levels of different chromatin marks in native chromatin were 214 entangled, i.e., they did not explore all possible values but were only found in a set of allowed 215 combinations encoding chromatin states with diverse functional roles [29] ( Figure 2a ). 216 Perturbing the enhancer mark H3K27ac alone (P1) had a modest impact on the predicted 217 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 9 profiles. An isolated in silico ablation of H3K27ac levels in active enhancer loci yielded still218 active chromatin states with unaltered accessibility and background H3K4me3 levels, unlike219 what would be expected in vivo [27] (Figure 2b, top). The integrative nature of the convolutions220 across tracks in the inputs led the model to map chromatin mark combinations to the221 corresponding RNA landscape instead of assigning independent roles to different input tracks.222 This limited the ability of single mark perturbations to infer biologically meaningful phenot ypes,223 but illustrated that the ratios between marks and their combined contributions rather than isolated224 mark enrichments define the regulatory role of a genomic region [31,32]. 225 226 227 Figure 2. Local silencing of enhancer loci mainly affects the neighboring genes. a)228 Combinations of chromatin mark enrichment values at a given bin sampled from chr 4. Example229 chromatin states and the resulting changes induced by perturbations P1 and P2 are illustrated. b)230 9 till ke ns he ks. es, ed a) le b) .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 10 Predicted output EU-seq profiles from inputs where only the enhancer mark H3K27ac has been 231 ablated at a given enhancer locus (P1) and inputs with silenced chromatin state substitution at the 232 enhancer locus (P2). The ablated enhancer loci, shown as small rectangles, and their 233 corresponding output profiles are depicted using the same color. Shown predictions are centered 234 at the protein coding gene Tmeff1 (ENSMUSG00000028347.14, chr 4). c) Relative change in 235 length normalized area under the EU-seq curve for each gene when silencing an enhancer (P2) at 236 a given distance of the gene’s TSS. Shown are all enhancer-promoter pairs left after 237 preprocessing of test samples. d) Number of cases in which the enhancer ablation affected the 238 most a gene located on either side (adjacent) vs the enhancer skipped a number of genes. Results 239 are shown for P2. 240 241 In silico silencing of enhancer loci induces mainly the downregulation of close genes 242 To better simulate the effects of an epigenetically silenced enhancer, the whole chromatin state at 243 each target enhancer locus was substituted for a heterochromatic-like state (P2), enhancer by 244 enhancer. An arbitrary silenced state was obtained by setting all marks except the silencing 245 H3K27me3 to zero reads ( Methods). The introduced chromatin state substitutions mainly 246 modified the predicted expression profiles around the perturbed point in a distance-dependent 247 fashion, having an influence on a varying number of genes ( Figure 2b, bottom ). Interestingly, 248 the perturbations led to a sustained decrease in EU-seq across the body of the affected domains 249 while keeping the overall structure characterizing the profiles. Clusters of neighboring ELS 250 affected the same region, pointing towards a coordinated action of several enhancers [11] . 251 Furthermore, the degree of downregulation that could be exerted by in silico silencing an active 252 enhancer on any given gene presented a clear inverse relation with the distance separating them, 253 decaying rapidly after a few kilobasepairs ( Figure 2c, Supplementary Table 4 ). From a 254 complementary point of view, the enhancers that had the highest impact on a gene’s nascent 255 transcription levels were in a high fraction adjacent to the gene (ca. 40%), with less and less 256 significant cases skipping increasing numbers of genes (Figure 2d, Supplementary Figure 9). 257 258 To summarize, predicting the EU-seq levels in a genomic region of 401 kbp per sample allowed 259 us to assess the range and degree of influence of a set of tailored in silico perturbations in 260 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 11 enhancer loci. The effects of such perturbations were not propagated towards regions far from 261 the perturbed locus, indicating a predominantly local behavior of the network. CLASTER linked 262 the perturbations in those loci to the adjacent domains with significant EU-seq enrichment, 263 modifying in a distance-dependent manner the transcription profiles and therefore affecting a 264 varying number of genes found around each perturbed locus. 265 266 Most of the provided context was not used for the predictions 267 To disentangle the contributions of all marks, we used the integrated gradients attribution method 268 [33]. This axiomatic attribution mechanism enabled us to quantify the relative importance of the 269 different chromatin marks at all measured loci towards the predicted EU-seq levels at each target 270 locus (Figure 3a). Positive and negative attribution scores indicate direct or inverse associations 271 between input and output enrichment levels, respectively. The resulting maps portrayed the 272 expected relative contributions of all marks at a given distance to an arbitrary predicted point 273 (Methods). 274 275 The promoter mark H3K4me3 was found to be the most important predictor of nascent 276 expression levels when using only the chromatin landscape branch of the model, as it was 277 assigned the highest attribution score (i.e. feature importance) around the predicted point (Figure 278 3b). The averaged attribution scores of the silencing mark H3K27me3 were comparable to those 279 of H3K4me3, while accessibility and the enhancer and promoter mark H3K27ac were assigned 280 lower scores, peaking at around 60% and 40% of that of H3K4me3, respectively. Strikingly, the 281 integrated gradients analysis presented a sharp, power-law decay of the attributions as a function 282 of the distance to the predicted locus. Attribution scores dropped to less than 20% of their peak 283 value after 5 to 10 kbp from the predicted locus, depending on the mark ( Figure 3b). Here, the 284 curves presented an elbow, after which the decay followed at a slower rate. Further than 50kbp 285 away, the attribution scores were almost negligible. Thus, CLASTER relied mainly on a window 286 of 20 kbp around the target locus to perform the predictions, while leaving most of the sequence 287 unattended. This rather local behavior, observed also in the perturbational scenario, contrasts 288 with the fact that 1Mbp of context was available to the network and provided a complementary 289 validation to the results presented in [9]. Of note, a significantly wider field of view than 20kbp 290 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 12 was covered by the dilated convolutions rather early in the network ( Supplementary Table 1),291 and therefore it is improbable that this locality appeared as a limitation of the receptive field of292 the convolutions. 293 294 295 Figure 3. Averaged attributions provide information about chromatin mark significance296 and scale of influence in the prediction task. a) Attribution method schema. Micro- C arrays297 are rotated 45 degrees and corners are cropped to yield rectangular maps (Supplementary298 Figure 8) . Each of the positions in both chromatin arrays and structural arrays are given an299 12 , of ce ys ry an .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 13 attribution score towards the prediction of each single output node. Attribution arrays are 300 centered, stacked and averaged to obtain an attribution map as in d). b) Averaged attributions of 301 each genomic track in the chromatin landscape branch. Attributions are centered at the prediction 302 site and rescaled between zero and the maximum attribution score. c) Chromatin landscape 303 attributions when the model is also provided contact maps as input. d) Absolute attribution map 304 for the chromatin structure branch. When overlaid to a contact map region like the blue or black 305 rectangles in a), it conveys the importance that the model attributes on average to each part of the 306 map when predicting nascent transcription levels at the genomic locus corresponding to the top 307 center of the map (the pointer). Color scale ranges from no attribution, i.e. Score=0, to maximum 308 attribution, with score 1. Map smoothed with gaussian kernel of σ =1. e) Average of signed 309 attribution score maps for the structure branch. Scale ranges from -1 to 1, where 0 corresponds to 310 no contribution. Positive values (red) are assigned to regions that contribute to raise output EU-311 seq levels. Negative values (blue) indicate an inverse relationship between input contact signals 312 and output EU-seq levels. Map smoothed with a gaussian kernel of σ =2 to ease the identification 313 of regions. f) Model performance comparison. The addition of contact information or attention to 314 handle long range interactions did not improve the accuracy of the predictions. 315 316 Micro-C maps provided an added layer of information and interpretability to CLASTER 317 The fact that the model was not attending to potential distal interactions motivated us to add 318 high-resolution chromatin contact maps explicitly as inputs to the model. Convolutional neural 319 networks could in principle be sensitive to various structural patterns in such contact matrices, 320 e.g. point-like interactions between separated elements , chromatin compartments and 321 topologically associated domains (TADs), which would not be explicitly accounted for in more 322 classical approaches [10]. 323 324 The use of contact maps in their usual, symmetrical format, induced in fact the creation of 325 several convolutional filters capturing the aforementioned features ( Supplementary Figure 10, 326 left). We found that the symmetry of the original matrices limited the power of image 327 processing-based approaches when studying Micro-C data, as the nuanced features extracted in 328 early stages of the network were lost in deeper layers, where primarily a broad sense of regional 329 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 14 contact was propagated forward (Supplementary Figure 10, right). In addition, complementing 330 the available information with contact maps induced a decrease in the attributions of the model 331 to the promoter mark ( Figure 3c ), suggesting a certain degree of overlap in the information 332 coming from the two data modalities. To introduce contact information in a more appropriate 333 manner, Micro-C matrices were rotated and cropped, and explored by scrolling the convolutional 334 filters on the sequence axis only ( Figure 3a, Supplementary Figure 11, Methods ). This 335 procedure reduced the redundancy of the information and skipped the contact-poor corner areas 336 of the contact maps. 337 338 Following the same steps as before, we proceeded to evaluate the contributions of every bin in a 339 Micro-C matrix towards the prediction of EU-seq levels at the central position ( Figure 3a). The 340 attribution map obtained from absolute scores brought to light two different behaviors of the 341 network depending on the distance in 1D sequence between points in physical contact ( Figure 342 3d). Attributions of bins connecting loci closer than ca. 30 kbp showed a clear peak around the 343 predicted point, i.e., the main contribution to the nascent transcription levels in the predicted 344 locus came from the inputs describing the local environment and interactions of the predicted 345 locus with its close neighborhood at a TAD scale. On the contrary, the network did not use the 346 local neighborhood of loci located more than 30 kbp upstream or downstream of the prediction 347 point. These results were in line with our previous findings based on chromatin state attributions. 348 Interestingly, the network did now attend to regions of the map linking more distal regions. The 349 network’s attributions for all bins connecting loci found at the same distance in 1D sequence 350 were quite similar from 30 kbp on, yielding a certain positional invariance ( Figure 3d ). The 351 main variation in attributions therefore occurred across different distances and was characterized 352 by a sharp increase at around 30kbp. 353 354 To assess whether a high level of contact in a certain region of the map contributed to raising or 355 lowering the expression levels at the predicted locus, we then computed averages of the maps 356 with signed attributions ( Figure 3e, Supplementary Figures 12-13 ). The neighborhood of the 357 predicted point had once more the highest impact on the output, with a higher number of counts 358 or contact in that region contributing the most to raising expression levels. Of note, positive 359 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 15 attribution scores were assigned to loci in physical contact with the predicted locus, indicating 360 also a positive association between input and output signals. This feature was quite conserved 361 across contact distances from 30kbp onwards and suggested a potential boost of expression in the 362 target profiles when a contact in such a region was to be found ( Figure 3e, blue lines) . These 363

Results

indicated that CLASTER could integrate contact information to that already provided in 364 the chromatin landscape, understanding the impact of features in different regions of the contact 365 maps towards the corresponding expression levels. The network now attended to distal 366 interacting elements, but the main contributor to the predictions was still the local environment 367 around the target locus, which displayed a clearly distinct behavior. 368 369 Explicit modeling of distal interactions did not significantly impact the performance 370 The introduction of multi-headed attention has been shown to improve the prediction 371 performance of DNA-based models [2,13,16]. Unlike convolutions, which process information at 372 different scales hierarchically, the attention mechanism allows the models to process all parts of 373 the sequence simultaneously. Chromatin-based models could also benefit from such a 374 mechanism if the sequential appearance of patterns in the input or the interactions between distal 375 regions were key factors to understand the data. We therefore introduced a Transformer Encoder 376 block with two multiheaded attention layers, aiming to process the sequential appearance of the 377 most abstract features in the chromatin landscape only setting ( Figure 1a, Supplementary 378 Figure 1, Supplementary Table 1) . However, a baseline model employing only convolutions 379 on the chromatin landscape branch alone reached rs=0.7814 and rp=0.8852, performing as well as 380 the one incorporating attention layers, which achieved r s=0.7815 and r p=0.8808. ( Figure 3f ). 381 This was still the case when coupling high resolution structural information from Micro-C maps 382 to CLASTER, with r s=0.7717 and r p=0.8690. Despite the model capturing rich features from 383 structural data and understanding their relation to the output as shown previously, the added 384 contact information did not boost the overall ability of CLASTER to fit the output EU-seq 385 profiles. 386 387 Overall, our findings show that predictions could mostly be performed without relying on 388 interactions between highly distal regulatory elements. The fact that dilated convolutions on 389 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 16 chromatin states provided most of the required information to predict EU-seq profiles suggests 390 that the chromatin landscape to RNA mapping problem might be better framed as a pattern 391 recognition and matching task than a sequence processing one when predicting expression at a 392 coarse, kilobasepair resolution. Finally, the final dense layers of CLASTER could already handle 393 long-range interactions of abstract features, and hence the attention mechanism might need to be 394 introduced earlier in the architecture to process lower-level features. 395 396

Discussion

397 Our results support a rather confined gene regulation scenario. A predominantly local regulatory 398 paradigm would seem plausible from the premises that most enhancers can activate most 399 promoters [6,7] and that both frequency and intensity of the interactions between different 400 genomic elements rapidly decay as a function of their separation in 1D sequence [17] and are 401 further constrained by the 3D compartmentalization of chromatin. An increased body of work 402 from different labs would in fact point in this direction from complementary perspectives. First, 403 recent CRISPR-based studies have found variants in noncoding regions to alter mostly their 404 nearmost genes [11,34–36] . Although predictive models have achieved high precisions in their 405 predictions, the naïve baseline model associating a given genomic variant to the closest gene still 406 provides a high sensitivity, i.e. a remarkably high number of true cases are captured under this 407 assumption [34,37]. Global enhancer ablation assays in a native chromatin environment 408 illustrated biases in the distribution of enhancers in the genome, showing a preference for cell 409 type specific genes to be frequently located close to strong enhancers [6]. Finally, recent high 410 resolution chromatin conformation capture assays proposed microcompartmentalization rather 411 than loop extrusion mechanisms to be driving enhancer-promoter interactions [18], linking 412 diseased phenotypes to the disruption of compartment boundaries [38]. 413 414 Despite these findings, the difficulty of state-of-the-art machine learning approaches to capture 415 the impact of distal enhancer variants in gene expression is still the focus of improvement for the 416 next generation of models, which are rapidly benefiting from the latest advances in NLP and 417 have a strong focus on the exploration of larger architectures [12] and the processing of longer 418 context lengths [13,15,19,20]. These efforts contrast with the fact that early transformer-based 419 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 17 models like the Enformer [2] already achieved remarkable results while relying on reduced 420 genomic windows for the predictions [9]. Such behavior, which both the Enformer and 421 CLASTER display, emerges as a consequence of the overall data distribution characterizing the 422 studied system [6]. In other words, the spatial arrangement and organization of different 423 regulatory elements in the genome does not reward the models enough to capture the expression 424 changes induced by highly distal functional interactions, which become weaker [39] and 425 increasingly rare with distance [9,40]. Despite long range interactions being reported to play 426 crucial roles in specific regulatory processes [41–43], the results presented here support the idea 427 that the information required to perform gene expression prediction is in most cases to be found 428 in the close neighborhood of the predicted locus. Neither the addition of contact information 429 matrices nor the introduction of the multi headed attention mechanism, both aimed to ease the 430 processing of long-range interactions, yielded significant changes in overall performance. These 431 findings, which build on top of previous work [6], have wide implications in the way we 432 understand genomic organization and the extent to which it influences the transcriptional 433 landscape, and invite us to rethink how new computational models can be designed to more 434 suitably exploit such intrinsic properties of the genome. 435 436 We show how in silico perturbations of a given chromatin landscape can provide an added layer 437 of explainability to the model’s predictions. These perturbative approaches exploit the rules in 438 the input that the models have learned to explore the effects of unseen epigenetic variability but 439 come with a number of challenges. On the one hand, perturbations of isolated input tracks can 440 fail to mimic the expected biological behavior given the correlations established between the 441 different marks, which define the chromatin states [29]. Chromatin-mark based models aimed to 442 perform new synthetic/in-silico perturbations should either aim to replicate a complete desired 443 chromatin state in a given locus of interest or successfully disentangle the roles of different 444 marks. Sequence-based models might be less prone to display such correlational effects given 445 that the one-hot encoding of nucleotides makes in-silico SNPs, insertions and deletions in-446 distribution alternative inputs for a given locus. However, the same type of correlational 447 dynamics and co-appearance patterns might be learned between different parts of the sequence. 448 In fact, Enformer has been shown to incorrectly predict the direction of the effects of certain 449 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 18 genomic variants, in particular for non-monogenic traits that would imply the combined action of 450 several loci [44,45]. These dynamics could constitute a hint towards the observation of a similar 451 correlational learning framework as the one we reported in our first perturbational simulations, in 452 this case across the sequence. 453 454 Left to be explored are better ways to couple high resolution contact matrices with chromatin 455 mark profiles, with graph attention networks already showing great promise [19]. Nevertheless, 456 the high sequencing depth requirement on higher resolution chromatin conformation maps and 457 the related costs compromise scalability for machine-learning tasks. Our findings would motivate 458 the exploration of region capture techniques to map shorter regions at a more fine-grained level 459 [18]. Further work should aim to move from a mostly correlational learning framework to a more 460 causal one, which is of utmost importance to determine the directionality of the association 461 between transcription and the presence of certain chromatin marks. Training chromatin-based 462 models in a GPT [46] or BERT [47] inspired masking fashion would perhaps provide the 463 opportunity to do so, disentangling the roles of different epigenetic marks while studying 464 sequences from varied cell type backgrounds. Finally, transcription factors and their binding 465 profiles are a key missing piece of our models, with their addition as extra input tracks being the 466 next logical step. 467 468

Conclusions

469 We present CLASTER, a deep convolutional neural network that can be used to build an implicit 470 gene regulation model, agnostic to DNA sequence and without relying on the manual extraction 471 of a number of curated features. CLASTER is capable of understanding the relations established 472 between different epigenetic marks describing a given chromatin state, how these states are 473 combined to create a chromatin landscape, and how a given landscape and its matching three-474 dimensional arrangement can be translated into nascent transcription levels. Given its 475 implementation using the EIR framework, CLASTER can easily be extended to new datasets, 476 data modalities and tasks. Therefore, we expect our work to serve as an inspiration and an easy 477 entry-point to epigenomics modeling for a broad range of researchers with diverse computational 478 resources and scientific backgrounds. 479 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 19 480

Methods

481 Processing of chromatin mark array data 482 Normalized enrichment levels for ATAC-seq, H3K4me3, H3K27ac and H3K27me3 in mESCs 483 were obtained at a 100bp resolution from publicly available data in the NCBI GEO under 484 accession numbers GSE146328 and GSE186349 . Each sample consisted of an array of shape 485 (Nchrom marks , Nbins) = (4, 10001) where each row was a different mark and each column a 100bp 486 bin with the mark’s averaged enrichment level. Samples were centered at the TSS of a gene of 487 interest. Note that a given gene could appear in different samples but at different locations, if it 488 was close enough to the central predicted gene. A detailed description of the genomic regions 489 constituting our samples and matching targets can be found in Supplementary Table 2, and a 490 guide through data obtention and preprocessing steps can be found at the Github repository of 491 the project https://github.com/RasmussenLab/CLASTER. 492 493 Processing of nascent RNA array data 494 Publicly available nascent transcription profiles matching the input samples were obtained from 495 GSE146328 [27]. Since our aim was to estimate integrated nascent gene expression levels and to 496 study the propagation and impact of in silico perturbations in as many genes as possible per 497 sample, we prioritized the prediction of a wide genomic region over output resolution. 1Mbp 498 regions were originally obtained at a resolution of 20 bp. A gaussian kernel of sigma = 50 was 499 then applied to smooth the signals before averaging them in bins of 1 kbp ( Supplementary 500 Figure 2). This procedure reduced drastically the number of targets while preserving the main 501 source of variation at the kbp scale, speeding up the computations of input-output attribution 502 scores. The resulting profiles were symmetrically cropped to allow all predicted genes to have 503 access to a wide chromatin range. Each target sample was finally chosen to be a region of 401 504 kbp centered at the TSS of the gene of interest. Target profiles were further cropped to fit the 505 target windows of the Enformer, Hyenadna-160k and Hyenadna-32k, detailed below. 506 507 Processing of micro-C contact maps 508 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 20 Micro-C matrices describing chromatin contact frequencies in mESCs were obtained at a 509 resolution of 1600bp per bin from NCBI GEO under accession number GSE130275 [28]. Lower 510 resolution mcool representations started to lose the nuances in structural detail provided by 511 MicroC, but higher resolution matrices became too sparse. Bin values had a range of 7 orders of 512 magnitude after balancing, with a minimum of 6,807 · 10-7 and a maximum of 2,001 in our data. 513 NaN values due to balancing and 0 count bins were then set to 1 ·10-14 → log 10=-14. This 514 arbitrarily small value was chosen to be the same distance apart from the minimum actual 515 contact value as the minimum to maximum contact range. Imputing by the minimum 10 -7 could 516 be understood as an averaged homogeneous background contact, which we tried to avoid by 517 giving those bins a clearly distinct behavior. Micro-C matrices were originally treated as 2D 518 arrays of shape (625, 625) containing log 10 transformed bin counts. A list of regions for which 519 Micro-C matching maps could not be obtained is found in Supplementary Table 5. 520 521 Rotations and cropping of Micro-C maps 522 Contact matrices displayed two properties: 1) they were symmetrical and 2) the number of reads 523 decreased when exploring regions far from the diagonal. Treating them as images, i.e. allowing 524 for the convolutions to scroll in both x and y directions, led to an under usage of the model’s 525 capabilities given the redundancy found in both triangles and the lack of information provided by 526 the regions close to the corners far from the diagonal. In order to provide the contact information 527 in a more appropriate manner for a model such as CLASTER, Micro-C matrices were rotated 45 528 degrees and cropped. This procedure allowed us to 1) reduce the number of inputs, accelerating 529 the computations and 2) define deeper 1D kernels exploring all distances at once. These kernels 530 were then only allowed to scroll sideways, in the original diagonal direction. Given the 531 symmetry of the matrices, we kept only the upper triangle. The rotated triangle was cropped to 532 ca. 0.207 times the original length to create a rectangular matrix of shape (129, 626). Distances 533 and scales in the new matrix are illustrated in Supplementary Figure 11 and can be expressed 534 as follows: 535 Expressing a and a/gFαFv in terms of L , we have: 536 2a /gααB4 L /gαv∂v √ L /g2870 /gααB4 L /g2870 /gαv∂v √ 2 L 537 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 21 We get a /gαv∂v /gvu4´ √ /g2870 /g2879/g2869 /g2870 /gvu4α L /gFlv∂ 0.2071067 812 L and a/gFαFv /gαv∂v a/ √ 2 . After rotating and cropping the 538 matrices, we have 626 bins for a smaller window of observation. The width of the new window 539 was then L/ √ 2 /gFlv∂ 0.70710678118 L where L /gαv∂v 1 Mb p . Linear bin densities in the original and 540 transformed Micro-C matrices were: 541 542 λ /g2925/g2928/g2919/g2917/g2919/g2924/g2911/g2922 /gαv∂v /g2874/g2870/g2873 /g2912/g2919/g2924/g2929 /g2869/g2868/g2868/g2868 /g2921/g2912/g2926 /gαv∂v 0.625 bi n/k bp , λ /g2930/g2928/g2911/g2924/g2929/g2916/g2925/g2928/g2923/g2915/g2914 /gαv∂v /g2874/g2870/g2874 /g2912/g2919/g2924/g2929 /g2875/g2868/g2875./g2869/g2868/g2874 /g2921/g2912/g2926 /gFlv∂ 0.8853 b in/k b p 543 544 The interaction distance d /g2869/g2888 , i.e. the real separation in 1D sequence between two interacting 545 elements, was implicitly encoded in the new y coordinate y /g2924/g2915/g2933 , which was the distance in the 546 perpendicular direction to the diagonal. A point at a distance y /g2924/g2915/g2933 from the diagonal, found at a 547 column bin y /g2912/g2919/g2924/g2929 in the rotated and cropped matrices, connected the points in the diagonal at 548 /gααB8x /g2912/g2919/g2924/g2929 /gαv∂v /gααB8|y /g2912/g2919/g2924/g2929 | and x /g2912/g2919/g2924/g2929 /gαv∂v| y /g2912/g2919/g2924/g2929 | given that the scale was the same in both axes. This 549 distance can be converted from new bins to kbp in 1D sequence using the diagonal bin density 550 found before as d /g2869/g2888 /gαv∂v /g2870/g1668/g2935 /g3160/g3167/g3172/g3177 /g2971 /g3178/g3176/g3159/g3172/g3177/g3164/g3173/g3176/g3171/g3163/g3162 . 551 552

Reference

genome, gene annotations and enhancer annotations 553 The mouse reference genome and protein coding gene annotations were downloaded from the 554 GENCODE (mouse; GRCm38 release 25) [48]. Proximal and distal Enhancer-Like Signatures 555 (pELS, dELS) in mice were obtained from Supplementary Table 11 in [30]. 556 557 In silico perturbations 558 In the first perturbational scenario (P1), the loss of enhancer activity was simulated by setting to 559 zero reads the levels of the enhancer mark H3K27ac, for all enhancers i /gαv∂v 1 , . . . , N /g2915/g2924/g2918 inside the 560 target window, enhancer by enhancer. In the second perturbational scenario (P2), a 561 heterochromatin state substitution was performed for all bins inside the enhancer boundaries by 562 setting the levels of all marks to 0 reads except from H3K27me3. This arbitrary silenced-like 563 state was aimed to remove the contributions of accessibility, background promoter and enhancer 564 activities while keeping a plausible level of H3K27me3 in the corresponding chromatin 565 environment. We then measured the relative change in averaged nascent transcription that the 566 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 22

Reference

gene in each sample experienced when ablating every single active enhancer falling 567 inside the predicted region of the genome. The relative change between baseline and perturbed 568 input predictions was computed as: 569 570 Δ /g2928/g2915/g2922 /g2919/g2920 /gαv∂v | A /g2920 /g3160/g3159/g3177/g3163/g3170/g3167/g3172/g3163 /gααB8 A /g2920 /g3167 /g3174/g3163/g3176/g3178/g3179/g3176/g3160/g3163/g3162 | A /g2920 /g3160/g3159/g3177/g3163/g3170/g3167/g3172/g3163 571 Here, A /g2920 /g3160/g3159/g3177/g3163/g3170/g3167/g3172/g3163 and A /g2920 /g3167 /g3174/g3163/g3176/g3178/g3179/g3176/g3160/g3163/g3162 described the area under the EU-seq curves inside the boundaries 572 of reference gene j normalized by that gene’s length, when predicting the output with the control 573 chromatin landscape and the output from the same chromatin landscape with the i /g2930/g2918 enhancer 574 ablated. To integrate the area under reference genes shorter than 1 kbp or spanning more than the 575 prediction window’s boundaries we used widths of 1 bin and half of the prediction window’s 576 length, since the TSS was located in the center. Only active enhancers were kept for the analyses, 577 where an active enhancer was defined as a pELS or dELS region with an average enrichment of 578 H3K27ac above a predefined threshold, set to be 10 reads. Enhancer gene pairs for reference 579 genes with zero reads were discarded from the analyses. Since the background distribution of 580 number of genes between enhancers and target genes was not uniform, histograms of the most 581 significant enhancer-gene pairs per sample were shown unnormalized (in the main manuscript) 582 and normalized by the background distribution of pairs present in the test dataset (in the 583 supplementary, Supplementary Figure 9). 584 585 Scoring the contributions of each relative input location to the output 586 To score the contributions of a given input position towards the expression levels in a target 587 locus, we used the widespread integrated gradients axiomatic attribution method [33] as 588 implemented in the Captum python library. Integrated gradients were computed as the path 589 integral of the gradients along the straight line path connecting a predefined baseline signal x ’ 590 and a given input x . The baseline background signal was obtained by averaging 256 samples in 591 each modality. The authors define the integrated gradient along the i /g2930/g2918 dimension as: 592 593 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 23 Int egr a t edGr a ds /g2919 /gvuuux/gvuu4 /gFl´4 : /gαv∂v /gvuuux /g2919 /gααB8x /g2919 /gFαFv /gvuu4 /gαv∂∂ /gαl∂l ∂F/gvuuux´ /gααB4 α /gαv∂∂ /gvuuux /gααB8 x/gFαFv/gvuu4/gvuu4 ∂x /g2919 dα /g2869 /g2961/g2880/g2868 594 As stated before, the chromatin landscape inputs were described by a matrix of shape (N chrom 595 marks, Ninput bins) = (4, 10001) exploring 1Mbp in bins of 100bp. Chromatin structure was given by 596 a matrix of shape (129, 626) describing the same region at a resolution of 1.6 kbp. EU-seq targets 597 were described by a vector of length N output bins = 401 exploring from -200kbp to 200kbp relative 598 to the TSS of a gene of interest in bins of 1 kbp. Signed and absolute attribution scores were then 599 obtained for every bin position in the input arrays towards the prediction of every target position 600 (Supplementary Figure 12). To obtain averages over predicted points, attribution maps for each 601 target node were then rolled in steps of 10 bins = 1kbp to locate the predicted target to the center 602 of the map. High frequency noise arising from the translations of the maps was removed by 603 adding a small uncertainty in the number of bins to roll for the chromatin landscape, adding eps 604 number of bins with -10 bins < eps < +10 bins. A similar smoothing effect was achieved for 605 structural attributions by applying a gaussian filter ( Supplementary Figure 13). The maps were 606 then symmetrically cropped to keep only the relative positions to the predicted point that had 607 been measured in all predictions, with measurements centered at -200 kbp and 200 kbp setting 608 the limits (Supplementary Figure 9). 609 610 CLASTER 611 The architecture of CLASTER is illustrated in Supplementary Figure 1 and further detailed in 612 Supplementary Table 1 . In order to translate the input to the corresponding output, features 613 were extracted separately from both branches of the model, namely the chromatin landscape 614 branch and the chromatin structure branch but using the same principles. These include a set of 615 layers implementing dilated convolutions, residual connections and optional attention heads. 616 Input dimensionality reduction was achieved by scrolling the convolutional kernels in strides of 617 2. The extracted features at different scales were integrated at a later stage in the network, the 618 fusion module. Further dimensionality reduction and mapping to the output nascent RNA levels 619 was performed with a final set of densely connected layers (MLP). The MLP ended in a hidden 620 vector representation, and finally each node was linearly connected to every single prediction 621 point, i.e 401 bins at 1kbp resolution. A final non-trainable ReLU was applied to cut negative 622 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 24 values. Configuration files to build and run the models using the EIR framework [26] in both 623 training and test settings are available at https://github.com/RasmussenLab/CLASTER. 624 625 Training 626 The network was trained in batches of 64 samples and a learning rate of 1e-4. We used the 627 AdamW optimizer with default parameters [49]. Regularization measures included the use of 628 high dropout values in the MLP layers, setting a given stochastic depth (probability to remove 629 residual block) and the use of SmoothL1 as the loss function. Flipped samples were also added 630 to the training set for data augmentation purposes but were not used for the test set analyses. 631 Chr17 samples were used as a hold out validation, and Chr4 was used as a hold out test. A 632 detailed description of the genes of interest that were used to create our samples can be found in 633 Supplementary Table 2. Samples for which we could not obtain Micro-C matching maps were 634 removed from the training, validation or test sets in all settings, and can be found in 635 Supplementary Table 4. 636 637 Benchmarks 638 Sequence embeddings corresponding to the central 160 kbp of the genomic regions of study were 639 obtained from the backbone of HyenaDNA-160k [15] with pretrained, frozen weights. Code to 640 build HyenaDNA-160k was obtained from the publicly accessible google colab [50] and 641 pretrained weights were obtained from [51]. The input sequences when using HyenaDNA-642 160kbp ranged in this case from -80kbp to 80kbp after the TSS of the central gene in the 643 observation window. We then trained a simple model or head on top of the frozen embeddings to 644 predict our EU-seq signals at the same resolution as we did with CLASTER ( Supplementary 645 Figures 3-4). The head performed an average pooling over the sequence axis to reduce sequence 646 length at a 128 bp resolution. To ease the predictions, a first set of convolutions of kernel 9 and 647 dilation 2 were applied to the resulting embeddings. This enabled the head to identify patterns at 648 a scale of ca. 2 kbp. These patterns were then mapped to the targets using a dense layer. 649 650 We tried to perform the regression of the target EU-Seq profiles using the pretrained embeddings 651 from the HyenaDNA-160k, but we encountered a number of challenges. First, in the original 652 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 25 manuscript the authors performed mainly classification tasks at a sequence level, but 653 simultaneous regressions of multiple outputs were not explored. Second, HyenaDNA was 654 pretrained on human and not mouse genomic sequences, although we would not expect this to be 655 a core issue. We believe that the pretrained embeddings obtained by next token prediction did 656 not encode enough functional information to perform the EU-Seq predictions, since further fine-657 tunning of the model was required for our downstream task. This was importantly not the case 658 with Enformer-based embeddings. This fine-tuning requirement might not be unique to 659 HyenaDNA. Finally, Hyena’s design was aimed to keep nucleotide resolution, but this would not 660 be necessary to predict averages of read counts at a coarse, kbp resolution. Despite the low 661 parameter count and high efficiency of the Hyena operator, which reduced drastically the number 662 of trainable parameters, the network’s design yielded a high number of inner sequence 663 representations, the activations at each filter in each layer, that kept sequence length (160,000 664 columns) while adding an embedding dimension (256 rows, e.g.). Each single sequence 665 representation as a matrix of float32 values required: 666 667 160 .000 /gαv∂∂ 256 val u e s /gαv∂∂ 32 bits 1 val u e /gαv∂∂ 1 b yt e 8 bi ts /gαv∂∂ 1 Gb 1024 /g2871 byte s /gFlαα0 .1526 G B 668 with many of them being stored already in the forward pass. This imposed a significant 669 challenge, rapidly increasing memory and computing time requirements, which constrained us to 670 fine-tune Hyena-160k in batches of only 2 sequences in an NVIDIA A100 GPU with 80GB of 671 RAM. This motivated us to use the smaller Hyena-32k model, for which we could load the 672 pretrained weights and fine-tune the pretrained backbone together with the added head in batches 673 of 16 samples in a reasonable period of time. Targets were cropped to match the new context 674 length to the range -15kbp to +15kbp. 675 676 The same procedure was repeated for the Enformer architecture [2]. Sequence embeddings 677 corresponding to the central 160kbp of the genomic regions of study were obtained from the 678 backbone of Enformer [2] with pretrained, frozen weights. Code to build the Enformer in 679 Pytorch and matching pretrained weights were obtained from [52]. To match the previous 680 scenario when using the Enformer, we used the same 160kbp sequences as in Hyena and 681 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 26 symmetrically prepended and appended N nucleotides as proposed in [9], which acted as masks, 682 to provide the same available information in both settings. The embeddings of the Enformer, of 683 shape (1, 896, 3072) corresponded to 896 /gαv∂∂ 128 bp /gαv∂v 114688 bp /gFlαα 115 k bp . The matching 684 targets were therefore cropped to keep the range between -57kbp to +57kbp from the central 685 gene’s TSS to match the Enformer’s prediction window. 686 687 Availability of data and materials 688 The raw data used in this project was obtained from publicly available sources [27,28,32]. The 689 original bigwig files for chromatin mark enrichment, ATAC-seq and EU-seq can be found at 690 NCBI’s GEO under accession numbers GSE146328 and GSE186349 . Mcool files storing high 691 resolution chromatin contact frequency maps can be found under the accession number 692 GSE130275. Processed files, input and target arrays and configuration files to build CLASTER 693 and reproduce our analyses can be obtained following the python notebooks found at the 694 project’s Github repository https://github.com/RasmussenLab/CLASTER. 695 All code is available at the github repository of the project. EIR can be found at 696 https://github.com/arnor-sigurdsson/EIR. The code of the project was organized in a number of 697 python notebooks covering 0) a tutorial, 1) how to download the raw data from NCBI and 698 preprocess it to create inputs and targets for EIR-based models, 2) how to create, train and test 699 CLASTER using EIR, 2b) how to add custom heads on top of Enformer or HyenaDNA 700 embeddings and fine-tune HyenaDNA-32k with an added custom regression head and finally 3) 701 how to perform the data analysis and create the figures presented in this manuscript. 702 703 Acknowledgments 704 We would like to thank Nils Krietenstein for his valuable input and feedback on the project. We 705 would also like to acknowledge the members of the Rasmussen lab, in particular Justus Gräf and 706 Henry Webel, for our fruitful discussions. We would finally like to thank the Novo Nordisk 707 Foundation Center for Protein Research, the N ovo Nordisk Foundation Center for Basic 708 Metabolic Research and the Novo Nordisk Foundation Center for Genomic Mechanisms of 709 Disease for their support. 710 711 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 27 Competing interests 712 S.R. is the founder and owner of BioAI. The remaining authors declare no conflicts of interest. 713 714 Funding 715 M.P.A was funded by the Novo Nordisk Foundation Copenhagen Bioscience Ph.D. Program 716 grant No. NNF22SA0078231. M.P.A, A.S., T.N., N.K, C.C, and S.R. were supported by the 717 Novo Nordisk Foundation (NNF14CC0001). M.P.A., A.S. and S.R. were supported by the Novo 718 Nordisk Foundation (NNF23SA0084103 and NNF21SA0072102). C.C. was supported by the 719 Novo Nordisk Fo undation (nos. NNF20OC0065482 and NNF22OC0074677). The Novo 720 Nordisk Foundation Center for Protein Research is financially supported by the N ovo Nordisk 721 Foundation (no. NNF14CC0001). 722 723 Author Contributions 724 S.R designed and supervised the project and wrote the manuscript. C.C designed and supervised 725 the project. M.P.A wrote the code, carried out the analyses, interpreted the results, and wrote the 726 manuscript. A.S wrote the code and assisted in the analyses. T.N acquired and preprocessed the 727 data and assisted in the analyses. All authors reviewed and edited the final manuscript. 728 729 Supplementary information 730 Additional file 1. Supplementary Figures 1 to 13. 731 Additional file 2. Supplementary Tables 1 to 5. 732 733

References

734 1. Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, Troyanskaya OG. Deep learning 735 sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet. 736 2018;50:1171–9. 737 2. Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. 738 Effective gene expression prediction from sequence by integrating long-range interactions. Nat 739 Methods. 2021;18:1196–203. 740 3. Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, et al. Base-resolution 741 models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021;53:354–66. 742 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 28 4. Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. Sequential regulatory 743 activity prediction across chromosomes with convolutional neural networks. Genome Res. 744 2018;28:739–50. 745 5. Chen KM, Wong AK, Troyanskaya OG, Zhou J. A sequence-based global map of regulatory 746 activity for deciphering human genetics. Nat Genet. 2022;54:940–9. 747 6. Narita T, Higashijima Y, Kilic S, Maskey E, Neumann K, Choudhary C. The logic of native 748 enhancer-promoter compatibility and cell-type-specific gene expression variation [Internet]. 749 bioRxiv. 2022 [cited 2023 Sep 1]. p. 2022.07.18.500456. Available from: 750 https://www.biorxiv.org/content/biorxiv/early/2022/07/19/2022.07.18.500456 751 7. Bergman DT, Jones TR, Liu V, Ray J, Jagoda E, Siraj L, et al. Compatibility rules of human 752 enhancer and promoter sequences. Nature. 2022;607:176–84. 753 8. Hong CKY, Cohen BA. Genomic environments scale the activities of diverse core promoters. 754 Genome Res. 2022;32:85–96. 755 9. Karollus A, Mauermeier T, Gagneur J. Current sequence-based models capture gene 756 expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 757 2023;24:56. 758 10. Fulco CP, Nasser J, Jones TR, Munson G, Bergman DT, Subramanian V, et al. Activity-by-759 contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat 760 Genet. 2019;51:1664–9. 761 11. Gschwind AR, Mualim KS, Karbalayghareh A, Sheth MU, Dey KK, Jagoda E, et al. An 762 encyclopedia of enhancer-gene regulatory interactions in the human genome [Internet]. bioRxiv. 763 2023 [cited 2023 Nov 21]. p. 2023.11.09.563812. Available from: 764 https://www.biorxiv.org/content/biorxiv/early/2023/11/13/2023.11.09.563812 765 12. Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Carranza NL, Grzywaczewski AH, Oteri F, 766 et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for 767 Human Genomics [Internet]. bioRxiv. 2023 [cited 2023 Nov 2]. p. 2023.01.11.523679. Available 768 from: https://www.biorxiv.org/content/biorxiv/early/2023/09/19/2023.01.11.523679 769 13. Linder J, Srivastava D, Yuan H, Agarwal V, Kelley DR. Predicting RNA-seq coverage from 770 DNA sequence as a unifying model of gene regulation [Internet]. bioRxiv. 2023 [cited 2023 Nov 771 2]. p. 2023.08.30.555582. Available from: 772 https://www.biorxiv.org/content/biorxiv/early/2023/09/01/2023.08.30.555582 773 14. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder 774 Representations from Transformers model for DNA-language in genome. Bioinformatics. 775 2021;37:2112–20. 776 15. Nguyen E, Poli M, Faizi M, Thomas A, Birch-Sykes C, Wornow M, et al. HyenaDNA: 777 Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [Internet]. arXiv 778 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 29 [cs.LG]. 2023. Available from: http://arxiv.org/abs/2306.15794 779 16. Nguyen E, Poli M, Durrant MG, Thomas AW, Kang B, Sullivan J, et al. Sequence modeling 780 and design from molecular to genome scale with Evo [Internet]. bioRxiv. 2024 [cited 2024 Mar 781 11]. p. 2024.02.27.582234. Available from: 782 https://www.biorxiv.org/content/10.1101/2024.02.27.582234v2 783 17. Krietenstein N, Abraham S, Venev SV, Abdennur N, Gibcus J, Hsieh T-HS, et al. 784 Ultrastructural details of mammalian chromosome architecture. Mol Cell. 2020;78:554–65.e7. 785 18. Goel VY, Huseyin MK, Hansen AS. Region Capture Micro-C reveals coalescence of 786 enhancers and promoters into nested microcompartments. Nat Genet. 2023;55:1048–56. 787 19. Karbalayghareh A, Sahin M, Leslie CS. Chromatin interaction-aware gene regulatory 788 modeling with graph attention networks. Genome Res. 2022;32:930–44. 789 20. Bigness J, Loinaz X, Patel S, Larschan E, Singh R. Integrating Long-Range Regulatory 790 Interactions to Predict Gene Expression Using Graph Convolutional Networks. J Comput Biol. 791 2022;29:409–24. 792 21. Singh R, Lanchantin J, Robins G, Qi Y. DeepChrome: deep-learning for predicting gene 793 expression from histone modifications. Bioinformatics. 2016;32:i639–48. 794 22. Chen Y, Xie M, Wen J. Predicting gene expression from histone modifications with self-795 attention based neural networks and transfer learning. Front Genet. 2022;13:1081842. 796 23. Yang R, Das A, Gao VR, Karbalayghareh A, Noble WS, Bilmes JA, et al. Epiphany: 797 predicting Hi-C contact maps from 1D epigenomic signals. Genome Biol. 2023;24:134. 798 24. Feng F, Yao Y, Wang XQD, Zhang X, Liu J. Connecting high-resolution 3D chromatin 799 organization with epigenomics. Nat Commun. 2022;13:2054. 800 25. Cappelluti MA, Mollica Poeta V, Valsoni S, Quarato P, Merlin S, Merelli I, et al. Durable 801 and efficient gene silencing in vivo by hit-and-run epigenome editing. Nature [Internet]. 2024; 802 Available from: http://dx.doi.org/10.1038/s41586-024-07087-8 803 26. Sigurdsson AI, Louloudis I, Banasik K, Westergaard D, Winther O, Lund O, et al. Deep 804 integrative models for large-scale human genomics. Nucleic Acids Res. 2023;51:e67. 805 27. Narita T, Ito S, Higashijima Y, Chu WK, Neumann K, Walter J, et al. Enhancers are 806 activated by p300/CBP activity-dependent PIC assembly, RNAPII recruitment, and pause 807 release. Mol Cell. 2021;81:2166–82.e6. 808 28. Hsieh T-HS, Cattoglio C, Slobodyanyuk E, Hansen AS, Rando OJ, Tjian R, et al. Resolving 809 the 3D Landscape of Transcription-Linked Mammalian Chromatin Folding. Mol Cell. 810 2020;78:539–53.e8. 811 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 30 29. Ernst J, Kellis M. Chromatin-state discovery and genome annotation with ChromHMM. Nat 812 Protoc. 2017;12:2478–92. 813 30. ENCODE Project Consortium, Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, et 814 al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 815 2020;583:699–710. 816 31. Huang C, Helin K. Catching active enhancers via H2B N-terminal acetylation. Nat. Genet. 817 2023. p. 525–6. 818 32. Narita T, Higashijima Y, Kilic S, Liebner T, Walter J, Choudhary C. Acetylation of histone 819 H2B marks active enhancers and predicts CBP/p300 target genes. Nat Genet. 2023;55:679–92. 820 33. Sundararajan M, Taly A, Yan Q. Axiomatic Attribution for Deep Networks [Internet]. arXiv 821 [cs.LG]. 2017. Available from: http://arxiv.org/abs/1703.01365 822 34. Nasser J, Bergman DT, Fulco CP, Guckelberger P, Doughty BR, Patwardhan TA, et al. 823 Genome-wide enhancer maps link risk variants to disease genes. Nature. 2021;593:238–43. 824 35. Morris JA, Caragine C, Daniloski Z, Domingo J, Barry T, Lu L, et al. Discovery of target 825 genes and pathways at GWAS loci by pooled single-cell CRISPR screens. Science. 826 2023;380:eadh7699. 827 36. Gasperini M, Hill AJ, McFaline-Figueroa JL, Martin B, Kim S, Zhang MD, et al. A Genome-828 wide Framework for Mapping Gene Regulation via Cellular Genetic Screens. Cell. 829 2019;176:1516. 830 37. Mountjoy E, Schmidt EM, Carmona M, Schwartzentruber J, Peat G, Miranda A, et al. An 831 open approach to systematically prioritize causal variants and genes at all published human 832 GWAS trait-associated loci. Nat Genet. 2021;53:1527–33. 833 38. Kim KL, Rahme GJ, Goel VY, El Farran CA, Hansen AS, Bernstein BE. Dissection of a 834 CTCF topological boundary uncovers principles of enhancer-oncogene regulation. Mol Cell 835 [Internet]. 2024; Available from: http://dx.doi.org/10.1016/j.molcel.2024.02.007 836 39. Zuin J, Roth G, Zhan Y, Cramard J, Redolfi J, Piskadlo E, et al. Nonlinear control of 837 transcription through enhancer-promoter interactions. Nature. 2022;604:571–7. 838 40. Michaud EJ, Liu Z, Girit U, Tegmark M. The Quantization Model of Neural Scaling. Thirty-839 seventh Conference on Neural Information Processing Systems [Internet]. 2023. Available from: 840 https://openreview.net/forum?id=3tbTw2ga8K 841 41. Lettice LA, Heaney SJH, Purdie LA, Li L, de Beer P, Oostra BA, et al. A long-range Shh 842 enhancer regulates expression in the developing limb and fin and is associated with preaxial 843 polydactyly. Hum Mol Genet. 2003;12:1725–35. 844 42. Uslu VV, Petretich M, Ruf S, Langenfeld K, Fonseca NA, Marioni JC, et al. Long-range 845 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint 31 enhancers regulating Myc expression are required for normal facial morphogenesis. Nat Genet. 846 2014;46:753–8. 847 43. Claussnitzer M, Dankel SN, Kim K-H, Quon G, Meuleman W, Haugen C, et al. FTO Obesity 848 Variant Circuitry and Adipocyte Browning in Humans. N Engl J Med. 2015;373:895–907. 849 44. Sasse A, Ng B, Spiro AE, Tasaki S, Bennett DA, Gaiteri C, et al. Benchmarking of deep 850 neural networks for predicting personal gene expression from DNA sequence highlights 851 shortcomings. bioRxiv [Internet]. 2023; Available from: 852 http://dx.doi.org/10.1101/2023.03.16.532969 853 45. Huang C, Shuai RW, Baokar P, Chung R, Rastogi R, Kathail P, et al. Personal transcriptome 854 variation is poorly explained by current genomic deep learning models. Nat Genet. 855 2023;55:2056–9. 856 46. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language Models are 857 Unsupervised Multitask Learners. OpenAI [Internet]. 2019; Available from: 858 https://dcmpx.remotevs.com/net/cloudfront/d4mucfpksywv/SL/better-language-859 models/language_models_are_unsupervised_multitask_learners.pdf 860 47. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional 861 Transformers for Language Understanding [Internet]. arXiv [cs.CL]. 2018. Available from: 862 http://arxiv.org/abs/1810.04805 863 48. Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, et al. GENCODE 864 2021. Nucleic Acids Res. 2021;49:D916–23. 865 49. Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. International Conference 866 on Learning Representations [Internet]. 2019. Available from: 867 https://openreview.net/forum?id=Bkg6RiCqY7 868 50. Google colab [Internet]. [cited 2024 May 15]. Available from: 869 https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL 870 51. LongSafari [Internet]. [cited 2024 May 15]. Available from: 871 https://huggingface.co/LongSafari 872 52. Wang P. enformer-pytorch: Implementation of Enformer, Deepmind’s attention network for 873 predicting gene expression, in Pytorch [Internet]. Github; [cited 2024 May 15]. Available from: 874 https://github.com/lucidrains/enformer-pytorch 875 876 .CC-BY-NC 4.0 International licenseavailable under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-NC-4.0