Background
30
Different cell types and their associated functionalities emerge from a single genomic sequence 31
when certain regions are expressed while others remain silenced. Modeling gene expression and 32
its potential malfunctioning in different cellular contexts is hence pivotal to understand both 33
development and disease. 34
35
Results
36
We present the Chromatin Landscape and Structure to Expression Regressor (CLASTER), an 37
epigenetic-based deep neural network that can integrate different data modalities describing the 38
chromatin landscape and its 3D structure. CLASTER effectively translates them into nascent 39
transcription levels measured by EU-seq at a kilobasepair resolution. Our predictions reached a 40
Pearson correlation with targets above r=0.86 at both bin and gene levels, without relying on 41
DNA sequence nor explicitly extracted chromatin features. The model mostly used the 42
information found within 10 kbp of the predicted locus, even when a wide genomic region of 1 43
Mbp was available. Explicit modeling of long-range interactions using multi-headed attention 44
and high-resolution chromatin contact maps had little impact on model performance, despite the 45
model correctly identifying elements in these inputs influencing nascent transcription. The 46
trained model served then as a platform to predict the transcriptional impact of simulated 47
epigenetic silencing perturbations. 48
49
Conclusions
50
Our results point towards a rather local, integrative and combinatorial paradigm of gene 51
regulation, where changes in the chromatin environment surrounding a gene shape its context-52
specific transcription. We conclude that the predominant locality and limitations of current 53
machine learning approaches might emerge as a genuine signature of genomic organization, 54
having broad implications for future modeling approaches. 55
56
Keywords
57
EU-seq, nascent transcription, chromatin landscape, Micro-C, deep neural networks 58
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
3
Background
59
A single genome gives rise to a wide variety of cells, with diverse morphological properties and 60
functions. These features of the cells emerge when certain regions of the DNA are expressed 61
while others are kept silent, a process known as gene regulation. The rules governing such 62
regulatory processes in different cellular contexts still constitute a highly debated topic, and their 63
understanding is crucial to study development and disease. Many models of gene regulation have 64
already been proposed approaching the problem at different biological layers [1–5]. At a high 65
level, gene expression is modulated by the interplay of a set of regulatory elements, e.g. 66
enhancers and promoters, which have been shown to be broadly compatible in most cases [6,7]. 67
Their activities are usually described using a number of epigenetic marks and can be combined 68
multiplicatively to shape RNA transcription [7–9]. The establishment of functional pairings 69
between these regulatory elements is then conditioned by their contact frequency, given by the 70
three-dimensional structure of chromatin. Models such as the Activity by Contact (ABC) [10] 71
and the more recent ENCODE-rE2G [11] exploit these features. 72
73
Regulatory elements and their associated genes are ultimately encoded in the DNA. Many 74
computational approaches have therefore put a special emphasis on the genomic sequence, 75
identifying patterns and motifs, integrating them and predicting a set of output tracks describing 76
various downstream genomic and epigenomic processes [1–5]. The advent of large language 77
models (LLMs) is also powering a new wave of sequence-based machine learning models such 78
as Nucleotide Transformer [12] , Borzoi [13] , DNABERT [14] , HyenaDNA [15] or Evo [16]. 79
These present improvements in several fronts, e.g., being more robust to genomic variability, 80
extending the studied genomic regions or achieving higher resolutions while bringing down the 81
computational costs. Finally, new advances in the 3D organization field and chromatin 82
conformation capture techniques have brought to light finer structural details (Micro-C) [17,18] 83
and also show promise to improve predictive modeling [19,20]. 84
85
However, this explosion in the genomics analysis toolbox is mostly inherited from the computer 86
vision and natural language processing fields, which pursue different objectives and describe 87
data modalities that attend to highly different sets of rules. This is best illustrated by the fact that 88
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
4
even some of the most advanced machine learning models rely primarily on the close 89
neighborhood of the predicted points, without paying attention to most of the sequence provided 90
[9]. This is of particular interest in the case of attention-based deep learning models, which 91
handle the complete sequences at once and therefore do not impose any spatial constraints. This 92
raises the question: are these machine learning models predominantly local because of current 93
technical limitations and training schemes, or is this a property of the underlying data 94
distribution, i.e., a signature of genomic organization induced by elements such as evolutionary 95
pressure? 96
97
Here we present Chromatin Landscape and Structure to Expression Regressor (CLASTER), a 98
deep convolutional neural network aimed to learn the mapping from a given chromatin landscape 99
and its three-dimensional structure to the corresponding nascent RNA transcription profiles. 100
CLASTER constitutes a cell-type agnostic model that integrates properties of both chromatin 101
state and structure without relying on the genomic sequence. The model can identify patterns 102
between chromatin marks and across the genome at several levels of abstraction and combine 103
them in a non-linear fashion. Of note, the multiplicative nature of enhancer and promoter 104
activities deeply motivates the use of non-linearities, which would not be captured by linear 105
models [1]. Similar principles have already been successfully applied in gene expression 106
classification [21,22] and regression [19], and contact map prediction [23,24]. Motivated by 107
recent findings of broad compatibility between regulatory elements, we identified the core set of 108
elements required to predict the nascent transcription landscape from an epigenetic perspective. 109
The trained models were then used to explore the power of new perturbational approaches on 110
such data to simulate the transcriptional impact of epigenetic remodeling, which has already 111
shown great therapeutic potential [25]. 112
113
Results
114
CLASTER integrated multi-modal chromatin data to predict nascent RNA levels in a 115
sequence-agnostic manner 116
We developed CLASTER, a deep neural network designed using the EIR framework [26] and 117
trained it to translate a given chromatin landscape and its corresponding high-resolution 3D 118
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
5
contact frequency map to the matching target nascent RNA profiles (Figure 1a, Supplementary 119
Figure 1, Supplementary Table 1 ). Nascent transcription levels were obtained for mouse 120
Embryonic Stem Cells (mESCs) from Narita et al. 2021 [27], where they had been measured by 121
RNA labeling with 5-ethynyl uridine and sequencing (EU-seq). CLASTER was trained to predict 122
kilobasepair resolution EU-seq profiles in a wide region, spanning from -200 kbp to 200 kbp 123
from the TSS of a gene of interest that was located at the center of the predicted window (Figure 124
1b, Supplementary Figure 2, Methods ). C hromosomes 17 and 4 were used to obtain our 125
validation and test set regions, respectively (Supplementary Table 2). 126
127
To perform the predictions, the network used two matrices representing the different data 128
modalities, which were processed in separate branches of the model. The first branch processed 129
four complementary genomic tracks that characterized the epigenetic landscape in a chromatin 130
region of 1Mbp at a 100bp resolution. These four core tracks describe chromatin accessibility 131
(ATAC-seq), promoter activity (H3K4me3), promoter and enhancer activity (H3K27ac) and 132
heterochromatin (H3K27me3), respectively. An optional second branch of the model was aimed 133
at processing high resolution structural data. This was given in the form of Micro-C contact 134
frequency maps exploring also 1Mbp of genomic context at a 1.6 kbp resolution. These maps 135
described contact frequencies between different parts of the genomic region explored in each 136
sample and were obtained for the same cell line from [28]. The model then extracted features in 137
both data modalities separately using a set of dilated convolutions and residual connections 138
(Figure 1a, Supplementary Figure 1 ). Optional multi-headed attention layers were then added 139
to the landscape branch to ease the processing of the sequential appearance of such features and 140
long-range interactions between them. The resulting high-level features were then combined and 141
integrated using a set of fully connected layers, and finally a dense layer connected the last 142
hidden layer to each output separately. 143
144
CLASTER displayed a high test prediction accuracy, with Spearman and Pearson correlations 145
between log transformed predicted values and ground truth signal intensities of r s=0.7717 and 146
rp=0.8690 across all predicted bins ( Figure 1c). Test performances increased to r s=0.9275 and 147
rp=0.8859 when comparing the expected nascent transcription levels inside target gene 148
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
6
boundaries (Figure 1d, Methods), where a lower number of near zero values would be expected.149
In summary, CLASTER predicted nascent RNA levels given the corresponding chromatin150
landscape and its matching high- resolution contact frequency map. This was achieved by151
extracting features at different levels of abstraction from both data modalities without relying on152
the DNA sequence. Of note, these predictions comprised most protein coding genes in153
chromosome 4, with no filters on specific ranges of expression levels nor their enhancer -154
dependence status. 155
156
157
6
d.
tin
by
on
in
-
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
7
Fig 1. CLASTER can translate the chromatin landscape and structure to the 158
corresponding nascent RNA profiles in mESCs. a) Model schema. CLASTER maps the 159
epigenetic landscape of a chromatin region of 1Mbp and its matching contact frequency map to 160
the corresponding nascent expression levels in a window of 401 kbp. b) Targets and predictions 161
around the gene Tmem222 (chr4) for the different models. Prediction ranges were adapted to the 162
context length of each model. c) Density plot showing log-transformed, binwise predictions vs. 163
target values in all test samples, i.e. across protein coding genes in chromosome 4. d) Length 164
normalized nascent transcription levels for the central gene in each predicted sample. 165
Normalized transcription levels were obtained by integrating the area under the EU-seq curve 166
inside the gene boundaries and normalizing by gene length. e) Pearson and Spearman 167
correlations between test predictions and targets at binwise and gene levels for the different 168
models. 169
170
Nascent transcription can be predicted from DNA sequence 171
The enrichment levels of different chromatin marks in a given locus, which define the chromatin 172
states [29], may be interpreted as epigenetic embeddings containing relevant functional 173
information of that genomic region. This makes them somewhat similar in role to the inner 174
sequence representations of DNA-based models like the Enformer [2] or HyenaDNA [15], and 175
therefore we wanted to compare their respective capabilities to predict the nascent RNA 176
landscape described by EU-seq. 177
178
Sequence embeddings of the central 160kbp regions matching our samples were obtained for the 179
Enformer model with pretrained, frozen weights (Supplementary Figure 3, Methods). We then 180
trained a simple head mapping the embeddings to the target EU-seq outputs ( Supplementary 181
Figures 4-5, Methods ). The performance of the model head on Enformer’s pretrained 182
embeddings, i.e., with no fine-tuning nor further training required, achieved Pearson and 183
Spearman correlations of r s=0.8067 and r p=0.8739 for the test predictions ( Figure 1e, 184
Supplementary Figures 6-8 ). However, the target window had to be shortened given 185
Enformer’s narrower context length and cropping of the latent sequence representations (Figure 186
1b, Supplementary Figure 8 ). Sequence embeddings obtained by pretraining the DNA 187
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
8
language models HyenaDNA-160-kbp and 32-kbp in a next nucleotide prediction task could not 188
be used in the same manner to directly predict nascent expression levels in a regression fashion 189
(Methods). The embeddings could however be tailored to our task by fine-tuning the pretrained 190
backbone and the added head together. Predictions using a fine-tuned HyenaDNA-32k model 191
displayed Pearson and Spearman correlation coefficients of rs=0.7514 and rp=0.7403 (Figure 1e, 192
Supplementary Figure 6). 193
In conclusion, DNA-sequence based models trained in a supervised manner like the Enformer 194
provide embeddings encoding valuable functional information, which can be directly probed to 195
predict nascent transcription levels. These results were to be expected since the Enformer was 196
originally trained to predict 1,643 mouse genomic tracks, including RNAPII binding and a 197
number of CAGE experiments that were highly related to our outputs. HyenaDNA, a genomic 198
sequence foundation model, could also be used to predict the nascent transcription landscape. 199
However, sequence embeddings obtained from pretraining HyenaDNA in a next nucleotide 200
prediction task did not encode the information necessary to predict our targets, requiring further 201
fine-tuning of both HyenaDNA backbone and the added model head. 202
203
CLASTER learned the functional roles of chromatin states rather than those of specific 204
marks 205
In silico perturbations of the inputs can allow us to retrieve some of the information learned by 206
the models, while enabling the simulation of arbitrary epigenetic variability. They might be used, 207
for instance, to identify potential transcriptional changes induced by an epigenetic silencing of 208
chromatin, a valuable alternative to genetic knock-out [25]. To this end, we performed virtual 209
perturbational experiments aimed to simulate a loss of the activity of single active enhancers 210
using only the chromatin landscape branch of CLASTER ( Methods). Proximal and distal 211
Enhancer Like Signatures (pELS, dELS), were obtained from [30] (Supplementary Table 3). 212
213
We found that the enrichment levels of different chromatin marks in native chromatin were 214
entangled, i.e., they did not explore all possible values but were only found in a set of allowed 215
combinations encoding chromatin states with diverse functional roles [29] ( Figure 2a ). 216
Perturbing the enhancer mark H3K27ac alone (P1) had a modest impact on the predicted 217
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
9
profiles. An isolated in silico ablation of H3K27ac levels in active enhancer loci yielded still218
active chromatin states with unaltered accessibility and background H3K4me3 levels, unlike219
what would be expected in vivo [27] (Figure 2b, top). The integrative nature of the convolutions220
across tracks in the inputs led the model to map chromatin mark combinations to the221
corresponding RNA landscape instead of assigning independent roles to different input tracks.222
This limited the ability of single mark perturbations to infer biologically meaningful phenot ypes,223
but illustrated that the ratios between marks and their combined contributions rather than isolated224
mark enrichments define the regulatory role of a genomic region [31,32]. 225
226
227
Figure 2. Local silencing of enhancer loci mainly affects the neighboring genes. a)228
Combinations of chromatin mark enrichment values at a given bin sampled from chr 4. Example229
chromatin states and the resulting changes induced by perturbations P1 and P2 are illustrated. b)230
9
till
ke
ns
he
ks.
es,
ed
a)
le
b)
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
10
Predicted output EU-seq profiles from inputs where only the enhancer mark H3K27ac has been 231
ablated at a given enhancer locus (P1) and inputs with silenced chromatin state substitution at the 232
enhancer locus (P2). The ablated enhancer loci, shown as small rectangles, and their 233
corresponding output profiles are depicted using the same color. Shown predictions are centered 234
at the protein coding gene Tmeff1 (ENSMUSG00000028347.14, chr 4). c) Relative change in 235
length normalized area under the EU-seq curve for each gene when silencing an enhancer (P2) at 236
a given distance of the gene’s TSS. Shown are all enhancer-promoter pairs left after 237
preprocessing of test samples. d) Number of cases in which the enhancer ablation affected the 238
most a gene located on either side (adjacent) vs the enhancer skipped a number of genes. Results 239
are shown for P2. 240
241
In silico silencing of enhancer loci induces mainly the downregulation of close genes 242
To better simulate the effects of an epigenetically silenced enhancer, the whole chromatin state at 243
each target enhancer locus was substituted for a heterochromatic-like state (P2), enhancer by 244
enhancer. An arbitrary silenced state was obtained by setting all marks except the silencing 245
H3K27me3 to zero reads ( Methods). The introduced chromatin state substitutions mainly 246
modified the predicted expression profiles around the perturbed point in a distance-dependent 247
fashion, having an influence on a varying number of genes ( Figure 2b, bottom ). Interestingly, 248
the perturbations led to a sustained decrease in EU-seq across the body of the affected domains 249
while keeping the overall structure characterizing the profiles. Clusters of neighboring ELS 250
affected the same region, pointing towards a coordinated action of several enhancers [11] . 251
Furthermore, the degree of downregulation that could be exerted by in silico silencing an active 252
enhancer on any given gene presented a clear inverse relation with the distance separating them, 253
decaying rapidly after a few kilobasepairs ( Figure 2c, Supplementary Table 4 ). From a 254
complementary point of view, the enhancers that had the highest impact on a gene’s nascent 255
transcription levels were in a high fraction adjacent to the gene (ca. 40%), with less and less 256
significant cases skipping increasing numbers of genes (Figure 2d, Supplementary Figure 9). 257
258
To summarize, predicting the EU-seq levels in a genomic region of 401 kbp per sample allowed 259
us to assess the range and degree of influence of a set of tailored in silico perturbations in 260
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
11
enhancer loci. The effects of such perturbations were not propagated towards regions far from 261
the perturbed locus, indicating a predominantly local behavior of the network. CLASTER linked 262
the perturbations in those loci to the adjacent domains with significant EU-seq enrichment, 263
modifying in a distance-dependent manner the transcription profiles and therefore affecting a 264
varying number of genes found around each perturbed locus. 265
266
Most of the provided context was not used for the predictions 267
To disentangle the contributions of all marks, we used the integrated gradients attribution method 268
[33]. This axiomatic attribution mechanism enabled us to quantify the relative importance of the 269
different chromatin marks at all measured loci towards the predicted EU-seq levels at each target 270
locus (Figure 3a). Positive and negative attribution scores indicate direct or inverse associations 271
between input and output enrichment levels, respectively. The resulting maps portrayed the 272
expected relative contributions of all marks at a given distance to an arbitrary predicted point 273
(Methods). 274
275
The promoter mark H3K4me3 was found to be the most important predictor of nascent 276
expression levels when using only the chromatin landscape branch of the model, as it was 277
assigned the highest attribution score (i.e. feature importance) around the predicted point (Figure 278
3b). The averaged attribution scores of the silencing mark H3K27me3 were comparable to those 279
of H3K4me3, while accessibility and the enhancer and promoter mark H3K27ac were assigned 280
lower scores, peaking at around 60% and 40% of that of H3K4me3, respectively. Strikingly, the 281
integrated gradients analysis presented a sharp, power-law decay of the attributions as a function 282
of the distance to the predicted locus. Attribution scores dropped to less than 20% of their peak 283
value after 5 to 10 kbp from the predicted locus, depending on the mark ( Figure 3b). Here, the 284
curves presented an elbow, after which the decay followed at a slower rate. Further than 50kbp 285
away, the attribution scores were almost negligible. Thus, CLASTER relied mainly on a window 286
of 20 kbp around the target locus to perform the predictions, while leaving most of the sequence 287
unattended. This rather local behavior, observed also in the perturbational scenario, contrasts 288
with the fact that 1Mbp of context was available to the network and provided a complementary 289
validation to the results presented in [9]. Of note, a significantly wider field of view than 20kbp 290
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
12
was covered by the dilated convolutions rather early in the network ( Supplementary Table 1),291
and therefore it is improbable that this locality appeared as a limitation of the receptive field of292
the convolutions. 293
294
295
Figure 3. Averaged attributions provide information about chromatin mark significance296
and scale of influence in the prediction task. a) Attribution method schema. Micro- C arrays297
are rotated 45 degrees and corners are cropped to yield rectangular maps (Supplementary298
Figure 8) . Each of the positions in both chromatin arrays and structural arrays are given an299
12
,
of
ce
ys
ry
an
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
13
attribution score towards the prediction of each single output node. Attribution arrays are 300
centered, stacked and averaged to obtain an attribution map as in d). b) Averaged attributions of 301
each genomic track in the chromatin landscape branch. Attributions are centered at the prediction 302
site and rescaled between zero and the maximum attribution score. c) Chromatin landscape 303
attributions when the model is also provided contact maps as input. d) Absolute attribution map 304
for the chromatin structure branch. When overlaid to a contact map region like the blue or black 305
rectangles in a), it conveys the importance that the model attributes on average to each part of the 306
map when predicting nascent transcription levels at the genomic locus corresponding to the top 307
center of the map (the pointer). Color scale ranges from no attribution, i.e. Score=0, to maximum 308
attribution, with score 1. Map smoothed with gaussian kernel of σ =1. e) Average of signed 309
attribution score maps for the structure branch. Scale ranges from -1 to 1, where 0 corresponds to 310
no contribution. Positive values (red) are assigned to regions that contribute to raise output EU-311
seq levels. Negative values (blue) indicate an inverse relationship between input contact signals 312
and output EU-seq levels. Map smoothed with a gaussian kernel of σ =2 to ease the identification 313
of regions. f) Model performance comparison. The addition of contact information or attention to 314
handle long range interactions did not improve the accuracy of the predictions. 315
316
Micro-C maps provided an added layer of information and interpretability to CLASTER 317
The fact that the model was not attending to potential distal interactions motivated us to add 318
high-resolution chromatin contact maps explicitly as inputs to the model. Convolutional neural 319
networks could in principle be sensitive to various structural patterns in such contact matrices, 320
e.g. point-like interactions between separated elements , chromatin compartments and 321
topologically associated domains (TADs), which would not be explicitly accounted for in more 322
classical approaches [10]. 323
324
The use of contact maps in their usual, symmetrical format, induced in fact the creation of 325
several convolutional filters capturing the aforementioned features ( Supplementary Figure 10, 326
left). We found that the symmetry of the original matrices limited the power of image 327
processing-based approaches when studying Micro-C data, as the nuanced features extracted in 328
early stages of the network were lost in deeper layers, where primarily a broad sense of regional 329
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
14
contact was propagated forward (Supplementary Figure 10, right). In addition, complementing 330
the available information with contact maps induced a decrease in the attributions of the model 331
to the promoter mark ( Figure 3c ), suggesting a certain degree of overlap in the information 332
coming from the two data modalities. To introduce contact information in a more appropriate 333
manner, Micro-C matrices were rotated and cropped, and explored by scrolling the convolutional 334
filters on the sequence axis only ( Figure 3a, Supplementary Figure 11, Methods ). This 335
procedure reduced the redundancy of the information and skipped the contact-poor corner areas 336
of the contact maps. 337
338
Following the same steps as before, we proceeded to evaluate the contributions of every bin in a 339
Micro-C matrix towards the prediction of EU-seq levels at the central position ( Figure 3a). The 340
attribution map obtained from absolute scores brought to light two different behaviors of the 341
network depending on the distance in 1D sequence between points in physical contact ( Figure 342
3d). Attributions of bins connecting loci closer than ca. 30 kbp showed a clear peak around the 343
predicted point, i.e., the main contribution to the nascent transcription levels in the predicted 344
locus came from the inputs describing the local environment and interactions of the predicted 345
locus with its close neighborhood at a TAD scale. On the contrary, the network did not use the 346
local neighborhood of loci located more than 30 kbp upstream or downstream of the prediction 347
point. These results were in line with our previous findings based on chromatin state attributions. 348
Interestingly, the network did now attend to regions of the map linking more distal regions. The 349
network’s attributions for all bins connecting loci found at the same distance in 1D sequence 350
were quite similar from 30 kbp on, yielding a certain positional invariance ( Figure 3d ). The 351
main variation in attributions therefore occurred across different distances and was characterized 352
by a sharp increase at around 30kbp. 353
354
To assess whether a high level of contact in a certain region of the map contributed to raising or 355
lowering the expression levels at the predicted locus, we then computed averages of the maps 356
with signed attributions ( Figure 3e, Supplementary Figures 12-13 ). The neighborhood of the 357
predicted point had once more the highest impact on the output, with a higher number of counts 358
or contact in that region contributing the most to raising expression levels. Of note, positive 359
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
15
attribution scores were assigned to loci in physical contact with the predicted locus, indicating 360
also a positive association between input and output signals. This feature was quite conserved 361
across contact distances from 30kbp onwards and suggested a potential boost of expression in the 362
target profiles when a contact in such a region was to be found ( Figure 3e, blue lines) . These 363
Results
indicated that CLASTER could integrate contact information to that already provided in 364
the chromatin landscape, understanding the impact of features in different regions of the contact 365
maps towards the corresponding expression levels. The network now attended to distal 366
interacting elements, but the main contributor to the predictions was still the local environment 367
around the target locus, which displayed a clearly distinct behavior. 368
369
Explicit modeling of distal interactions did not significantly impact the performance 370
The introduction of multi-headed attention has been shown to improve the prediction 371
performance of DNA-based models [2,13,16]. Unlike convolutions, which process information at 372
different scales hierarchically, the attention mechanism allows the models to process all parts of 373
the sequence simultaneously. Chromatin-based models could also benefit from such a 374
mechanism if the sequential appearance of patterns in the input or the interactions between distal 375
regions were key factors to understand the data. We therefore introduced a Transformer Encoder 376
block with two multiheaded attention layers, aiming to process the sequential appearance of the 377
most abstract features in the chromatin landscape only setting ( Figure 1a, Supplementary 378
Figure 1, Supplementary Table 1) . However, a baseline model employing only convolutions 379
on the chromatin landscape branch alone reached rs=0.7814 and rp=0.8852, performing as well as 380
the one incorporating attention layers, which achieved r s=0.7815 and r p=0.8808. ( Figure 3f ). 381
This was still the case when coupling high resolution structural information from Micro-C maps 382
to CLASTER, with r s=0.7717 and r p=0.8690. Despite the model capturing rich features from 383
structural data and understanding their relation to the output as shown previously, the added 384
contact information did not boost the overall ability of CLASTER to fit the output EU-seq 385
profiles. 386
387
Overall, our findings show that predictions could mostly be performed without relying on 388
interactions between highly distal regulatory elements. The fact that dilated convolutions on 389
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
16
chromatin states provided most of the required information to predict EU-seq profiles suggests 390
that the chromatin landscape to RNA mapping problem might be better framed as a pattern 391
recognition and matching task than a sequence processing one when predicting expression at a 392
coarse, kilobasepair resolution. Finally, the final dense layers of CLASTER could already handle 393
long-range interactions of abstract features, and hence the attention mechanism might need to be 394
introduced earlier in the architecture to process lower-level features. 395
396
Discussion
397
Our results support a rather confined gene regulation scenario. A predominantly local regulatory 398
paradigm would seem plausible from the premises that most enhancers can activate most 399
promoters [6,7] and that both frequency and intensity of the interactions between different 400
genomic elements rapidly decay as a function of their separation in 1D sequence [17] and are 401
further constrained by the 3D compartmentalization of chromatin. An increased body of work 402
from different labs would in fact point in this direction from complementary perspectives. First, 403
recent CRISPR-based studies have found variants in noncoding regions to alter mostly their 404
nearmost genes [11,34–36] . Although predictive models have achieved high precisions in their 405
predictions, the naïve baseline model associating a given genomic variant to the closest gene still 406
provides a high sensitivity, i.e. a remarkably high number of true cases are captured under this 407
assumption [34,37]. Global enhancer ablation assays in a native chromatin environment 408
illustrated biases in the distribution of enhancers in the genome, showing a preference for cell 409
type specific genes to be frequently located close to strong enhancers [6]. Finally, recent high 410
resolution chromatin conformation capture assays proposed microcompartmentalization rather 411
than loop extrusion mechanisms to be driving enhancer-promoter interactions [18], linking 412
diseased phenotypes to the disruption of compartment boundaries [38]. 413
414
Despite these findings, the difficulty of state-of-the-art machine learning approaches to capture 415
the impact of distal enhancer variants in gene expression is still the focus of improvement for the 416
next generation of models, which are rapidly benefiting from the latest advances in NLP and 417
have a strong focus on the exploration of larger architectures [12] and the processing of longer 418
context lengths [13,15,19,20]. These efforts contrast with the fact that early transformer-based 419
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
17
models like the Enformer [2] already achieved remarkable results while relying on reduced 420
genomic windows for the predictions [9]. Such behavior, which both the Enformer and 421
CLASTER display, emerges as a consequence of the overall data distribution characterizing the 422
studied system [6]. In other words, the spatial arrangement and organization of different 423
regulatory elements in the genome does not reward the models enough to capture the expression 424
changes induced by highly distal functional interactions, which become weaker [39] and 425
increasingly rare with distance [9,40]. Despite long range interactions being reported to play 426
crucial roles in specific regulatory processes [41–43], the results presented here support the idea 427
that the information required to perform gene expression prediction is in most cases to be found 428
in the close neighborhood of the predicted locus. Neither the addition of contact information 429
matrices nor the introduction of the multi headed attention mechanism, both aimed to ease the 430
processing of long-range interactions, yielded significant changes in overall performance. These 431
findings, which build on top of previous work [6], have wide implications in the way we 432
understand genomic organization and the extent to which it influences the transcriptional 433
landscape, and invite us to rethink how new computational models can be designed to more 434
suitably exploit such intrinsic properties of the genome. 435
436
We show how in silico perturbations of a given chromatin landscape can provide an added layer 437
of explainability to the model’s predictions. These perturbative approaches exploit the rules in 438
the input that the models have learned to explore the effects of unseen epigenetic variability but 439
come with a number of challenges. On the one hand, perturbations of isolated input tracks can 440
fail to mimic the expected biological behavior given the correlations established between the 441
different marks, which define the chromatin states [29]. Chromatin-mark based models aimed to 442
perform new synthetic/in-silico perturbations should either aim to replicate a complete desired 443
chromatin state in a given locus of interest or successfully disentangle the roles of different 444
marks. Sequence-based models might be less prone to display such correlational effects given 445
that the one-hot encoding of nucleotides makes in-silico SNPs, insertions and deletions in-446
distribution alternative inputs for a given locus. However, the same type of correlational 447
dynamics and co-appearance patterns might be learned between different parts of the sequence. 448
In fact, Enformer has been shown to incorrectly predict the direction of the effects of certain 449
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
18
genomic variants, in particular for non-monogenic traits that would imply the combined action of 450
several loci [44,45]. These dynamics could constitute a hint towards the observation of a similar 451
correlational learning framework as the one we reported in our first perturbational simulations, in 452
this case across the sequence. 453
454
Left to be explored are better ways to couple high resolution contact matrices with chromatin 455
mark profiles, with graph attention networks already showing great promise [19]. Nevertheless, 456
the high sequencing depth requirement on higher resolution chromatin conformation maps and 457
the related costs compromise scalability for machine-learning tasks. Our findings would motivate 458
the exploration of region capture techniques to map shorter regions at a more fine-grained level 459
[18]. Further work should aim to move from a mostly correlational learning framework to a more 460
causal one, which is of utmost importance to determine the directionality of the association 461
between transcription and the presence of certain chromatin marks. Training chromatin-based 462
models in a GPT [46] or BERT [47] inspired masking fashion would perhaps provide the 463
opportunity to do so, disentangling the roles of different epigenetic marks while studying 464
sequences from varied cell type backgrounds. Finally, transcription factors and their binding 465
profiles are a key missing piece of our models, with their addition as extra input tracks being the 466
next logical step. 467
468
Conclusions
469
We present CLASTER, a deep convolutional neural network that can be used to build an implicit 470
gene regulation model, agnostic to DNA sequence and without relying on the manual extraction 471
of a number of curated features. CLASTER is capable of understanding the relations established 472
between different epigenetic marks describing a given chromatin state, how these states are 473
combined to create a chromatin landscape, and how a given landscape and its matching three-474
dimensional arrangement can be translated into nascent transcription levels. Given its 475
implementation using the EIR framework, CLASTER can easily be extended to new datasets, 476
data modalities and tasks. Therefore, we expect our work to serve as an inspiration and an easy 477
entry-point to epigenomics modeling for a broad range of researchers with diverse computational 478
resources and scientific backgrounds. 479
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
19
480
Methods
481
Processing of chromatin mark array data 482
Normalized enrichment levels for ATAC-seq, H3K4me3, H3K27ac and H3K27me3 in mESCs 483
were obtained at a 100bp resolution from publicly available data in the NCBI GEO under 484
accession numbers GSE146328 and GSE186349 . Each sample consisted of an array of shape 485
(Nchrom marks , Nbins) = (4, 10001) where each row was a different mark and each column a 100bp 486
bin with the mark’s averaged enrichment level. Samples were centered at the TSS of a gene of 487
interest. Note that a given gene could appear in different samples but at different locations, if it 488
was close enough to the central predicted gene. A detailed description of the genomic regions 489
constituting our samples and matching targets can be found in Supplementary Table 2, and a 490
guide through data obtention and preprocessing steps can be found at the Github repository of 491
the project https://github.com/RasmussenLab/CLASTER. 492
493
Processing of nascent RNA array data 494
Publicly available nascent transcription profiles matching the input samples were obtained from 495
GSE146328 [27]. Since our aim was to estimate integrated nascent gene expression levels and to 496
study the propagation and impact of in silico perturbations in as many genes as possible per 497
sample, we prioritized the prediction of a wide genomic region over output resolution. 1Mbp 498
regions were originally obtained at a resolution of 20 bp. A gaussian kernel of sigma = 50 was 499
then applied to smooth the signals before averaging them in bins of 1 kbp ( Supplementary 500
Figure 2). This procedure reduced drastically the number of targets while preserving the main 501
source of variation at the kbp scale, speeding up the computations of input-output attribution 502
scores. The resulting profiles were symmetrically cropped to allow all predicted genes to have 503
access to a wide chromatin range. Each target sample was finally chosen to be a region of 401 504
kbp centered at the TSS of the gene of interest. Target profiles were further cropped to fit the 505
target windows of the Enformer, Hyenadna-160k and Hyenadna-32k, detailed below. 506
507
Processing of micro-C contact maps 508
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
20
Micro-C matrices describing chromatin contact frequencies in mESCs were obtained at a 509
resolution of 1600bp per bin from NCBI GEO under accession number GSE130275 [28]. Lower 510
resolution mcool representations started to lose the nuances in structural detail provided by 511
MicroC, but higher resolution matrices became too sparse. Bin values had a range of 7 orders of 512
magnitude after balancing, with a minimum of 6,807 · 10-7 and a maximum of 2,001 in our data. 513
NaN values due to balancing and 0 count bins were then set to 1 ·10-14 → log 10=-14. This 514
arbitrarily small value was chosen to be the same distance apart from the minimum actual 515
contact value as the minimum to maximum contact range. Imputing by the minimum 10 -7 could 516
be understood as an averaged homogeneous background contact, which we tried to avoid by 517
giving those bins a clearly distinct behavior. Micro-C matrices were originally treated as 2D 518
arrays of shape (625, 625) containing log 10 transformed bin counts. A list of regions for which 519
Micro-C matching maps could not be obtained is found in Supplementary Table 5. 520
521
Rotations and cropping of Micro-C maps 522
Contact matrices displayed two properties: 1) they were symmetrical and 2) the number of reads 523
decreased when exploring regions far from the diagonal. Treating them as images, i.e. allowing 524
for the convolutions to scroll in both x and y directions, led to an under usage of the model’s 525
capabilities given the redundancy found in both triangles and the lack of information provided by 526
the regions close to the corners far from the diagonal. In order to provide the contact information 527
in a more appropriate manner for a model such as CLASTER, Micro-C matrices were rotated 45 528
degrees and cropped. This procedure allowed us to 1) reduce the number of inputs, accelerating 529
the computations and 2) define deeper 1D kernels exploring all distances at once. These kernels 530
were then only allowed to scroll sideways, in the original diagonal direction. Given the 531
symmetry of the matrices, we kept only the upper triangle. The rotated triangle was cropped to 532
ca. 0.207 times the original length to create a rectangular matrix of shape (129, 626). Distances 533
and scales in the new matrix are illustrated in Supplementary Figure 11 and can be expressed 534
as follows: 535
Expressing a and a/gFαFv in terms of L , we have: 536
2a /gααB4 L /gαv∂v √ L /g2870 /gααB4 L /g2870 /gαv∂v √ 2 L 537
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
21
We get a /gαv∂v /gvu4´ √ /g2870 /g2879/g2869
/g2870 /gvu4α L /gFlv∂ 0.2071067 812 L and a/gFαFv /gαv∂v a/ √ 2 . After rotating and cropping the 538
matrices, we have 626 bins for a smaller window of observation. The width of the new window 539
was then L/ √ 2 /gFlv∂ 0.70710678118 L where L /gαv∂v 1 Mb p . Linear bin densities in the original and 540
transformed Micro-C matrices were: 541
542
λ /g2925/g2928/g2919/g2917/g2919/g2924/g2911/g2922 /gαv∂v
/g2874/g2870/g2873 /g2912/g2919/g2924/g2929
/g2869/g2868/g2868/g2868 /g2921/g2912/g2926 /gαv∂v 0.625 bi n/k bp , λ /g2930/g2928/g2911/g2924/g2929/g2916/g2925/g2928/g2923/g2915/g2914 /gαv∂v
/g2874/g2870/g2874 /g2912/g2919/g2924/g2929
/g2875/g2868/g2875./g2869/g2868/g2874 /g2921/g2912/g2926 /gFlv∂ 0.8853 b in/k b p 543
544
The interaction distance d /g2869/g2888 , i.e. the real separation in 1D sequence between two interacting 545
elements, was implicitly encoded in the new y coordinate y /g2924/g2915/g2933 , which was the distance in the 546
perpendicular direction to the diagonal. A point at a distance y /g2924/g2915/g2933 from the diagonal, found at a 547
column bin y /g2912/g2919/g2924/g2929 in the rotated and cropped matrices, connected the points in the diagonal at 548
/gααB8x /g2912/g2919/g2924/g2929 /gαv∂v /gααB8|y /g2912/g2919/g2924/g2929 | and x /g2912/g2919/g2924/g2929 /gαv∂v| y /g2912/g2919/g2924/g2929 | given that the scale was the same in both axes. This 549
distance can be converted from new bins to kbp in 1D sequence using the diagonal bin density 550
found before as d /g2869/g2888 /gαv∂v
/g2870/g1668/g2935 /g3160/g3167/g3172/g3177
/g2971 /g3178/g3176/g3159/g3172/g3177/g3164/g3173/g3176/g3171/g3163/g3162
. 551
552
Reference
genome, gene annotations and enhancer annotations 553
The mouse reference genome and protein coding gene annotations were downloaded from the 554
GENCODE (mouse; GRCm38 release 25) [48]. Proximal and distal Enhancer-Like Signatures 555
(pELS, dELS) in mice were obtained from Supplementary Table 11 in [30]. 556
557
In silico perturbations 558
In the first perturbational scenario (P1), the loss of enhancer activity was simulated by setting to 559
zero reads the levels of the enhancer mark H3K27ac, for all enhancers i /gαv∂v 1 , . . . , N /g2915/g2924/g2918 inside the 560
target window, enhancer by enhancer. In the second perturbational scenario (P2), a 561
heterochromatin state substitution was performed for all bins inside the enhancer boundaries by 562
setting the levels of all marks to 0 reads except from H3K27me3. This arbitrary silenced-like 563
state was aimed to remove the contributions of accessibility, background promoter and enhancer 564
activities while keeping a plausible level of H3K27me3 in the corresponding chromatin 565
environment. We then measured the relative change in averaged nascent transcription that the 566
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
22
Reference
gene in each sample experienced when ablating every single active enhancer falling 567
inside the predicted region of the genome. The relative change between baseline and perturbed 568
input predictions was computed as: 569
570
Δ /g2928/g2915/g2922 /g2919/g2920 /gαv∂v
| A /g2920 /g3160/g3159/g3177/g3163/g3170/g3167/g3172/g3163 /gααB8 A /g2920 /g3167 /g3174/g3163/g3176/g3178/g3179/g3176/g3160/g3163/g3162 |
A /g2920 /g3160/g3159/g3177/g3163/g3170/g3167/g3172/g3163
571
Here, A /g2920 /g3160/g3159/g3177/g3163/g3170/g3167/g3172/g3163 and A /g2920 /g3167 /g3174/g3163/g3176/g3178/g3179/g3176/g3160/g3163/g3162 described the area under the EU-seq curves inside the boundaries 572
of reference gene j normalized by that gene’s length, when predicting the output with the control 573
chromatin landscape and the output from the same chromatin landscape with the i /g2930/g2918 enhancer 574
ablated. To integrate the area under reference genes shorter than 1 kbp or spanning more than the 575
prediction window’s boundaries we used widths of 1 bin and half of the prediction window’s 576
length, since the TSS was located in the center. Only active enhancers were kept for the analyses, 577
where an active enhancer was defined as a pELS or dELS region with an average enrichment of 578
H3K27ac above a predefined threshold, set to be 10 reads. Enhancer gene pairs for reference 579
genes with zero reads were discarded from the analyses. Since the background distribution of 580
number of genes between enhancers and target genes was not uniform, histograms of the most 581
significant enhancer-gene pairs per sample were shown unnormalized (in the main manuscript) 582
and normalized by the background distribution of pairs present in the test dataset (in the 583
supplementary, Supplementary Figure 9). 584
585
Scoring the contributions of each relative input location to the output 586
To score the contributions of a given input position towards the expression levels in a target 587
locus, we used the widespread integrated gradients axiomatic attribution method [33] as 588
implemented in the Captum python library. Integrated gradients were computed as the path 589
integral of the gradients along the straight line path connecting a predefined baseline signal x ’ 590
and a given input x . The baseline background signal was obtained by averaging 256 samples in 591
each modality. The authors define the integrated gradient along the i /g2930/g2918 dimension as: 592
593
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
23
Int egr a t edGr a ds /g2919 /gvuuux/gvuu4 /gFl´4 : /gαv∂v /gvuuux /g2919 /gααB8x /g2919 /gFαFv /gvuu4 /gαv∂∂ /gαl∂l ∂F/gvuuux´ /gααB4 α /gαv∂∂ /gvuuux /gααB8 x/gFαFv/gvuu4/gvuu4
∂x /g2919
dα
/g2869
/g2961/g2880/g2868
594
As stated before, the chromatin landscape inputs were described by a matrix of shape (N chrom 595
marks, Ninput bins) = (4, 10001) exploring 1Mbp in bins of 100bp. Chromatin structure was given by 596
a matrix of shape (129, 626) describing the same region at a resolution of 1.6 kbp. EU-seq targets 597
were described by a vector of length N output bins = 401 exploring from -200kbp to 200kbp relative 598
to the TSS of a gene of interest in bins of 1 kbp. Signed and absolute attribution scores were then 599
obtained for every bin position in the input arrays towards the prediction of every target position 600
(Supplementary Figure 12). To obtain averages over predicted points, attribution maps for each 601
target node were then rolled in steps of 10 bins = 1kbp to locate the predicted target to the center 602
of the map. High frequency noise arising from the translations of the maps was removed by 603
adding a small uncertainty in the number of bins to roll for the chromatin landscape, adding eps 604
number of bins with -10 bins < eps < +10 bins. A similar smoothing effect was achieved for 605
structural attributions by applying a gaussian filter ( Supplementary Figure 13). The maps were 606
then symmetrically cropped to keep only the relative positions to the predicted point that had 607
been measured in all predictions, with measurements centered at -200 kbp and 200 kbp setting 608
the limits (Supplementary Figure 9). 609
610
CLASTER 611
The architecture of CLASTER is illustrated in Supplementary Figure 1 and further detailed in 612
Supplementary Table 1 . In order to translate the input to the corresponding output, features 613
were extracted separately from both branches of the model, namely the chromatin landscape 614
branch and the chromatin structure branch but using the same principles. These include a set of 615
layers implementing dilated convolutions, residual connections and optional attention heads. 616
Input dimensionality reduction was achieved by scrolling the convolutional kernels in strides of 617
2. The extracted features at different scales were integrated at a later stage in the network, the 618
fusion module. Further dimensionality reduction and mapping to the output nascent RNA levels 619
was performed with a final set of densely connected layers (MLP). The MLP ended in a hidden 620
vector representation, and finally each node was linearly connected to every single prediction 621
point, i.e 401 bins at 1kbp resolution. A final non-trainable ReLU was applied to cut negative 622
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
24
values. Configuration files to build and run the models using the EIR framework [26] in both 623
training and test settings are available at https://github.com/RasmussenLab/CLASTER. 624
625
Training 626
The network was trained in batches of 64 samples and a learning rate of 1e-4. We used the 627
AdamW optimizer with default parameters [49]. Regularization measures included the use of 628
high dropout values in the MLP layers, setting a given stochastic depth (probability to remove 629
residual block) and the use of SmoothL1 as the loss function. Flipped samples were also added 630
to the training set for data augmentation purposes but were not used for the test set analyses. 631
Chr17 samples were used as a hold out validation, and Chr4 was used as a hold out test. A 632
detailed description of the genes of interest that were used to create our samples can be found in 633
Supplementary Table 2. Samples for which we could not obtain Micro-C matching maps were 634
removed from the training, validation or test sets in all settings, and can be found in 635
Supplementary Table 4. 636
637
Benchmarks 638
Sequence embeddings corresponding to the central 160 kbp of the genomic regions of study were 639
obtained from the backbone of HyenaDNA-160k [15] with pretrained, frozen weights. Code to 640
build HyenaDNA-160k was obtained from the publicly accessible google colab [50] and 641
pretrained weights were obtained from [51]. The input sequences when using HyenaDNA-642
160kbp ranged in this case from -80kbp to 80kbp after the TSS of the central gene in the 643
observation window. We then trained a simple model or head on top of the frozen embeddings to 644
predict our EU-seq signals at the same resolution as we did with CLASTER ( Supplementary 645
Figures 3-4). The head performed an average pooling over the sequence axis to reduce sequence 646
length at a 128 bp resolution. To ease the predictions, a first set of convolutions of kernel 9 and 647
dilation 2 were applied to the resulting embeddings. This enabled the head to identify patterns at 648
a scale of ca. 2 kbp. These patterns were then mapped to the targets using a dense layer. 649
650
We tried to perform the regression of the target EU-Seq profiles using the pretrained embeddings 651
from the HyenaDNA-160k, but we encountered a number of challenges. First, in the original 652
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
25
manuscript the authors performed mainly classification tasks at a sequence level, but 653
simultaneous regressions of multiple outputs were not explored. Second, HyenaDNA was 654
pretrained on human and not mouse genomic sequences, although we would not expect this to be 655
a core issue. We believe that the pretrained embeddings obtained by next token prediction did 656
not encode enough functional information to perform the EU-Seq predictions, since further fine-657
tunning of the model was required for our downstream task. This was importantly not the case 658
with Enformer-based embeddings. This fine-tuning requirement might not be unique to 659
HyenaDNA. Finally, Hyena’s design was aimed to keep nucleotide resolution, but this would not 660
be necessary to predict averages of read counts at a coarse, kbp resolution. Despite the low 661
parameter count and high efficiency of the Hyena operator, which reduced drastically the number 662
of trainable parameters, the network’s design yielded a high number of inner sequence 663
representations, the activations at each filter in each layer, that kept sequence length (160,000 664
columns) while adding an embedding dimension (256 rows, e.g.). Each single sequence 665
representation as a matrix of float32 values required: 666
667
160 .000 /gαv∂∂ 256 val u e s /gαv∂∂ 32 bits
1 val u e /gαv∂∂ 1 b yt e
8 bi ts /gαv∂∂ 1 Gb
1024 /g2871 byte s /gFlαα0 .1526 G B
668
with many of them being stored already in the forward pass. This imposed a significant 669
challenge, rapidly increasing memory and computing time requirements, which constrained us to 670
fine-tune Hyena-160k in batches of only 2 sequences in an NVIDIA A100 GPU with 80GB of 671
RAM. This motivated us to use the smaller Hyena-32k model, for which we could load the 672
pretrained weights and fine-tune the pretrained backbone together with the added head in batches 673
of 16 samples in a reasonable period of time. Targets were cropped to match the new context 674
length to the range -15kbp to +15kbp. 675
676
The same procedure was repeated for the Enformer architecture [2]. Sequence embeddings 677
corresponding to the central 160kbp of the genomic regions of study were obtained from the 678
backbone of Enformer [2] with pretrained, frozen weights. Code to build the Enformer in 679
Pytorch and matching pretrained weights were obtained from [52]. To match the previous 680
scenario when using the Enformer, we used the same 160kbp sequences as in Hyena and 681
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
26
symmetrically prepended and appended N nucleotides as proposed in [9], which acted as masks, 682
to provide the same available information in both settings. The embeddings of the Enformer, of 683
shape (1, 896, 3072) corresponded to 896 /gαv∂∂ 128 bp /gαv∂v 114688 bp /gFlαα 115 k bp . The matching 684
targets were therefore cropped to keep the range between -57kbp to +57kbp from the central 685
gene’s TSS to match the Enformer’s prediction window. 686
687
Availability of data and materials 688
The raw data used in this project was obtained from publicly available sources [27,28,32]. The 689
original bigwig files for chromatin mark enrichment, ATAC-seq and EU-seq can be found at 690
NCBI’s GEO under accession numbers GSE146328 and GSE186349 . Mcool files storing high 691
resolution chromatin contact frequency maps can be found under the accession number 692
GSE130275. Processed files, input and target arrays and configuration files to build CLASTER 693
and reproduce our analyses can be obtained following the python notebooks found at the 694
project’s Github repository https://github.com/RasmussenLab/CLASTER. 695
All code is available at the github repository of the project. EIR can be found at 696
https://github.com/arnor-sigurdsson/EIR. The code of the project was organized in a number of 697
python notebooks covering 0) a tutorial, 1) how to download the raw data from NCBI and 698
preprocess it to create inputs and targets for EIR-based models, 2) how to create, train and test 699
CLASTER using EIR, 2b) how to add custom heads on top of Enformer or HyenaDNA 700
embeddings and fine-tune HyenaDNA-32k with an added custom regression head and finally 3) 701
how to perform the data analysis and create the figures presented in this manuscript. 702
703
Acknowledgments 704
We would like to thank Nils Krietenstein for his valuable input and feedback on the project. We 705
would also like to acknowledge the members of the Rasmussen lab, in particular Justus Gräf and 706
Henry Webel, for our fruitful discussions. We would finally like to thank the Novo Nordisk 707
Foundation Center for Protein Research, the N ovo Nordisk Foundation Center for Basic 708
Metabolic Research and the Novo Nordisk Foundation Center for Genomic Mechanisms of 709
Disease for their support. 710
711
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
27
Competing interests 712
S.R. is the founder and owner of BioAI. The remaining authors declare no conflicts of interest. 713
714
Funding 715
M.P.A was funded by the Novo Nordisk Foundation Copenhagen Bioscience Ph.D. Program 716
grant No. NNF22SA0078231. M.P.A, A.S., T.N., N.K, C.C, and S.R. were supported by the 717
Novo Nordisk Foundation (NNF14CC0001). M.P.A., A.S. and S.R. were supported by the Novo 718
Nordisk Foundation (NNF23SA0084103 and NNF21SA0072102). C.C. was supported by the 719
Novo Nordisk Fo undation (nos. NNF20OC0065482 and NNF22OC0074677). The Novo 720
Nordisk Foundation Center for Protein Research is financially supported by the N ovo Nordisk 721
Foundation (no. NNF14CC0001). 722
723
Author Contributions 724
S.R designed and supervised the project and wrote the manuscript. C.C designed and supervised 725
the project. M.P.A wrote the code, carried out the analyses, interpreted the results, and wrote the 726
manuscript. A.S wrote the code and assisted in the analyses. T.N acquired and preprocessed the 727
data and assisted in the analyses. All authors reviewed and edited the final manuscript. 728
729
Supplementary information 730
Additional file 1. Supplementary Figures 1 to 13. 731
Additional file 2. Supplementary Tables 1 to 5. 732
733
References
734
1. Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, Troyanskaya OG. Deep learning 735
sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet. 736
2018;50:1171–9. 737
2. Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. 738
Effective gene expression prediction from sequence by integrating long-range interactions. Nat 739
Methods. 2021;18:1196–203. 740
3. Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, et al. Base-resolution 741
models of transcription-factor binding reveal soft motif syntax. Nat Genet. 2021;53:354–66. 742
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
28
4. Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. Sequential regulatory 743
activity prediction across chromosomes with convolutional neural networks. Genome Res. 744
2018;28:739–50. 745
5. Chen KM, Wong AK, Troyanskaya OG, Zhou J. A sequence-based global map of regulatory 746
activity for deciphering human genetics. Nat Genet. 2022;54:940–9. 747
6. Narita T, Higashijima Y, Kilic S, Maskey E, Neumann K, Choudhary C. The logic of native 748
enhancer-promoter compatibility and cell-type-specific gene expression variation [Internet]. 749
bioRxiv. 2022 [cited 2023 Sep 1]. p. 2022.07.18.500456. Available from: 750
https://www.biorxiv.org/content/biorxiv/early/2022/07/19/2022.07.18.500456 751
7. Bergman DT, Jones TR, Liu V, Ray J, Jagoda E, Siraj L, et al. Compatibility rules of human 752
enhancer and promoter sequences. Nature. 2022;607:176–84. 753
8. Hong CKY, Cohen BA. Genomic environments scale the activities of diverse core promoters. 754
Genome Res. 2022;32:85–96. 755
9. Karollus A, Mauermeier T, Gagneur J. Current sequence-based models capture gene 756
expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 757
2023;24:56. 758
10. Fulco CP, Nasser J, Jones TR, Munson G, Bergman DT, Subramanian V, et al. Activity-by-759
contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat 760
Genet. 2019;51:1664–9. 761
11. Gschwind AR, Mualim KS, Karbalayghareh A, Sheth MU, Dey KK, Jagoda E, et al. An 762
encyclopedia of enhancer-gene regulatory interactions in the human genome [Internet]. bioRxiv. 763
2023 [cited 2023 Nov 21]. p. 2023.11.09.563812. Available from: 764
https://www.biorxiv.org/content/biorxiv/early/2023/11/13/2023.11.09.563812 765
12. Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Carranza NL, Grzywaczewski AH, Oteri F, 766
et al. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for 767
Human Genomics [Internet]. bioRxiv. 2023 [cited 2023 Nov 2]. p. 2023.01.11.523679. Available 768
from: https://www.biorxiv.org/content/biorxiv/early/2023/09/19/2023.01.11.523679 769
13. Linder J, Srivastava D, Yuan H, Agarwal V, Kelley DR. Predicting RNA-seq coverage from 770
DNA sequence as a unifying model of gene regulation [Internet]. bioRxiv. 2023 [cited 2023 Nov 771
2]. p. 2023.08.30.555582. Available from: 772
https://www.biorxiv.org/content/biorxiv/early/2023/09/01/2023.08.30.555582 773
14. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder 774
Representations from Transformers model for DNA-language in genome. Bioinformatics. 775
2021;37:2112–20. 776
15. Nguyen E, Poli M, Faizi M, Thomas A, Birch-Sykes C, Wornow M, et al. HyenaDNA: 777
Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [Internet]. arXiv 778
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
29
[cs.LG]. 2023. Available from: http://arxiv.org/abs/2306.15794 779
16. Nguyen E, Poli M, Durrant MG, Thomas AW, Kang B, Sullivan J, et al. Sequence modeling 780
and design from molecular to genome scale with Evo [Internet]. bioRxiv. 2024 [cited 2024 Mar 781
11]. p. 2024.02.27.582234. Available from: 782
https://www.biorxiv.org/content/10.1101/2024.02.27.582234v2 783
17. Krietenstein N, Abraham S, Venev SV, Abdennur N, Gibcus J, Hsieh T-HS, et al. 784
Ultrastructural details of mammalian chromosome architecture. Mol Cell. 2020;78:554–65.e7. 785
18. Goel VY, Huseyin MK, Hansen AS. Region Capture Micro-C reveals coalescence of 786
enhancers and promoters into nested microcompartments. Nat Genet. 2023;55:1048–56. 787
19. Karbalayghareh A, Sahin M, Leslie CS. Chromatin interaction-aware gene regulatory 788
modeling with graph attention networks. Genome Res. 2022;32:930–44. 789
20. Bigness J, Loinaz X, Patel S, Larschan E, Singh R. Integrating Long-Range Regulatory 790
Interactions to Predict Gene Expression Using Graph Convolutional Networks. J Comput Biol. 791
2022;29:409–24. 792
21. Singh R, Lanchantin J, Robins G, Qi Y. DeepChrome: deep-learning for predicting gene 793
expression from histone modifications. Bioinformatics. 2016;32:i639–48. 794
22. Chen Y, Xie M, Wen J. Predicting gene expression from histone modifications with self-795
attention based neural networks and transfer learning. Front Genet. 2022;13:1081842. 796
23. Yang R, Das A, Gao VR, Karbalayghareh A, Noble WS, Bilmes JA, et al. Epiphany: 797
predicting Hi-C contact maps from 1D epigenomic signals. Genome Biol. 2023;24:134. 798
24. Feng F, Yao Y, Wang XQD, Zhang X, Liu J. Connecting high-resolution 3D chromatin 799
organization with epigenomics. Nat Commun. 2022;13:2054. 800
25. Cappelluti MA, Mollica Poeta V, Valsoni S, Quarato P, Merlin S, Merelli I, et al. Durable 801
and efficient gene silencing in vivo by hit-and-run epigenome editing. Nature [Internet]. 2024; 802
Available from: http://dx.doi.org/10.1038/s41586-024-07087-8 803
26. Sigurdsson AI, Louloudis I, Banasik K, Westergaard D, Winther O, Lund O, et al. Deep 804
integrative models for large-scale human genomics. Nucleic Acids Res. 2023;51:e67. 805
27. Narita T, Ito S, Higashijima Y, Chu WK, Neumann K, Walter J, et al. Enhancers are 806
activated by p300/CBP activity-dependent PIC assembly, RNAPII recruitment, and pause 807
release. Mol Cell. 2021;81:2166–82.e6. 808
28. Hsieh T-HS, Cattoglio C, Slobodyanyuk E, Hansen AS, Rando OJ, Tjian R, et al. Resolving 809
the 3D Landscape of Transcription-Linked Mammalian Chromatin Folding. Mol Cell. 810
2020;78:539–53.e8. 811
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
30
29. Ernst J, Kellis M. Chromatin-state discovery and genome annotation with ChromHMM. Nat 812
Protoc. 2017;12:2478–92. 813
30. ENCODE Project Consortium, Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, et 814
al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 815
2020;583:699–710. 816
31. Huang C, Helin K. Catching active enhancers via H2B N-terminal acetylation. Nat. Genet. 817
2023. p. 525–6. 818
32. Narita T, Higashijima Y, Kilic S, Liebner T, Walter J, Choudhary C. Acetylation of histone 819
H2B marks active enhancers and predicts CBP/p300 target genes. Nat Genet. 2023;55:679–92. 820
33. Sundararajan M, Taly A, Yan Q. Axiomatic Attribution for Deep Networks [Internet]. arXiv 821
[cs.LG]. 2017. Available from: http://arxiv.org/abs/1703.01365 822
34. Nasser J, Bergman DT, Fulco CP, Guckelberger P, Doughty BR, Patwardhan TA, et al. 823
Genome-wide enhancer maps link risk variants to disease genes. Nature. 2021;593:238–43. 824
35. Morris JA, Caragine C, Daniloski Z, Domingo J, Barry T, Lu L, et al. Discovery of target 825
genes and pathways at GWAS loci by pooled single-cell CRISPR screens. Science. 826
2023;380:eadh7699. 827
36. Gasperini M, Hill AJ, McFaline-Figueroa JL, Martin B, Kim S, Zhang MD, et al. A Genome-828
wide Framework for Mapping Gene Regulation via Cellular Genetic Screens. Cell. 829
2019;176:1516. 830
37. Mountjoy E, Schmidt EM, Carmona M, Schwartzentruber J, Peat G, Miranda A, et al. An 831
open approach to systematically prioritize causal variants and genes at all published human 832
GWAS trait-associated loci. Nat Genet. 2021;53:1527–33. 833
38. Kim KL, Rahme GJ, Goel VY, El Farran CA, Hansen AS, Bernstein BE. Dissection of a 834
CTCF topological boundary uncovers principles of enhancer-oncogene regulation. Mol Cell 835
[Internet]. 2024; Available from: http://dx.doi.org/10.1016/j.molcel.2024.02.007 836
39. Zuin J, Roth G, Zhan Y, Cramard J, Redolfi J, Piskadlo E, et al. Nonlinear control of 837
transcription through enhancer-promoter interactions. Nature. 2022;604:571–7. 838
40. Michaud EJ, Liu Z, Girit U, Tegmark M. The Quantization Model of Neural Scaling. Thirty-839
seventh Conference on Neural Information Processing Systems [Internet]. 2023. Available from: 840
https://openreview.net/forum?id=3tbTw2ga8K 841
41. Lettice LA, Heaney SJH, Purdie LA, Li L, de Beer P, Oostra BA, et al. A long-range Shh 842
enhancer regulates expression in the developing limb and fin and is associated with preaxial 843
polydactyly. Hum Mol Genet. 2003;12:1725–35. 844
42. Uslu VV, Petretich M, Ruf S, Langenfeld K, Fonseca NA, Marioni JC, et al. Long-range 845
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
31
enhancers regulating Myc expression are required for normal facial morphogenesis. Nat Genet. 846
2014;46:753–8. 847
43. Claussnitzer M, Dankel SN, Kim K-H, Quon G, Meuleman W, Haugen C, et al. FTO Obesity 848
Variant Circuitry and Adipocyte Browning in Humans. N Engl J Med. 2015;373:895–907. 849
44. Sasse A, Ng B, Spiro AE, Tasaki S, Bennett DA, Gaiteri C, et al. Benchmarking of deep 850
neural networks for predicting personal gene expression from DNA sequence highlights 851
shortcomings. bioRxiv [Internet]. 2023; Available from: 852
http://dx.doi.org/10.1101/2023.03.16.532969 853
45. Huang C, Shuai RW, Baokar P, Chung R, Rastogi R, Kathail P, et al. Personal transcriptome 854
variation is poorly explained by current genomic deep learning models. Nat Genet. 855
2023;55:2056–9. 856
46. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language Models are 857
Unsupervised Multitask Learners. OpenAI [Internet]. 2019; Available from: 858
https://dcmpx.remotevs.com/net/cloudfront/d4mucfpksywv/SL/better-language-859
models/language_models_are_unsupervised_multitask_learners.pdf 860
47. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional 861
Transformers for Language Understanding [Internet]. arXiv [cs.CL]. 2018. Available from: 862
http://arxiv.org/abs/1810.04805 863
48. Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, et al. GENCODE 864
2021. Nucleic Acids Res. 2021;49:D916–23. 865
49. Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. International Conference 866
on Learning Representations [Internet]. 2019. Available from: 867
https://openreview.net/forum?id=Bkg6RiCqY7 868
50. Google colab [Internet]. [cited 2024 May 15]. Available from: 869
https://colab.research.google.com/drive/1wyVEQd4R3HYLTUOXEEQmp_I8aNC_aLhL 870
51. LongSafari [Internet]. [cited 2024 May 15]. Available from: 871
https://huggingface.co/LongSafari 872
52. Wang P. enformer-pytorch: Implementation of Enformer, Deepmind’s attention network for 873
predicting gene expression, in Pytorch [Internet]. Github; [cited 2024 May 15]. Available from: 874
https://github.com/lucidrains/enformer-pytorch 875
876
.CC-BY-NC 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted June 6, 2024. ; https://doi.org/10.1101/2024.06.04.597340doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.