Abstract
11
12
Understanding how the brain represents sensory information and triggers behavioural responses is a 13
fundamental goal in neuroscience. Recent advances in neuronal recording techniques aim to progress towards 14
this milestone, yet the resulting high dimensional responses are challenging to interpret and link to relevant 15
variables. In this work, we introduce SPARKS, a model capable of generating low dimensional latent 16
representations of high dimensional neural recordings. SPARKS adapts the self-attention mechanism of large 17
language models to extract information from the timing of single spikes and the sequence in which neurons 18
fire using Hebbian learning. Trained with a criterion inspired by predictive coding to enforce temporal 19
coherence, our model produces interpretable embeddings that are robust across animals and sessions. 20
Behavioural recordings can be used to inform the latent representations of the neural data, and we 21
demonstrate state-of-the-art predictive capabilities across diverse electrophysiology and calcium imaging 22
datasets from the motor, visual and entorhinal cortices. We also show how SPARKS can be applied to large 23
neuronal networks by revealing the temporal evolution of visual information encoding across the hierarchy of 24
the visual cortex. Overall, by integrating biological mechanisms into a machine learning model, we provide a 25
powerful tool to study large-scale network dynamics. SPARKS’ capacity to generalize across animals and 26
behavioural states suggests it is capable of estimating the internal latent generative model of the world in 27
animals, paving the way towards a foundation model for neuroscience. 28
29
Introduction
30
31
Recent advancements in both imaging techniques and electrophysiology have enabled neuroscientists to 32
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
capture neuronal activity at an unprecedented scale [1]. Experimental scientists now benefit from the 33
capacity to simultaneously record hundreds to thousands, and even a million neurons over extended periods 34
[2], [3], [4], [5]. Interpreting this neural activity and relating it to sensory, motor, and cognitive processes is a 35
key goal of neuroscience [6], yet this task remains inherently challenging for a number of reasons. 36
First of all, neural recordings are high dimensional by nature, both spatially – because of the number of 37
neurons recorded, and temporally – because of high temporal resolution or long recording times, making 38
them intractable to us. To circumvent this issue, previous studies often relied on single-neuron or average 39
population analysis, and averaged spike trains to consider firing rates [7], [8]. Whilst convenient for analysis, 40
doing so masks the information found at the level of single neurons and single spikes, and relies on arbitrary 41
choices of bin width for averaging. This results in loss of information which can be crucial for analysing 42
neuronal recordings [9], for instance to understand physiological processes with multiple relevant timescales. 43
44
Unfortunately, conventional signal processing techniques are ill-suited to operate on sparse, discrete, and 45
non-uniformly sampled spike trains, making them inappropriate for the analysis of neural data. The 46
development of effective mathematical frameworks to analyse such data is a nascent area of research [10]. 47
Additionally, a substantial portion of the variance in neural recordings can typically be attributed to 48
behavioural parameters like uninstructed movements, making it harder to find relevant information pertaining 49
to e.g. sensory processing [11]. Lastly, there is substantial variability across trials, sessions and animals 50
reported by numerous studies when trying to link neural activity to relevant metrics [12], [13]. Altogether, 51
these obstacles render the task of making sense of the high dimensional neuronal responses very challenging. 52
53
Focusing on low-dimensional representations of neural data has shown promise, as previous research 54
indicates that information encoded by neural activity in several brain regions and across behavioural tasks 55
can often be represented by a few dimensions [8], [14], [15]. Several machine learning (ML) techniques have 56
been employed to derive meaningful and exploitable low-dimensional representations of high-dimensional 57
neural data [16]. However, these meth58
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
59
ods also often ignore precise spike timing and the intrinsic sequential structure of neural data. This leads to 60
loss of critical temporal information and requires extensive pre-processing and manual curation, limiting 61
standardization and reproducibility [7]. 62
63
To address these challenges, we propose to leverage recent developments in the processing of sequential data 64
via machine learning models, notably with the Transformer model powered by self-attention mechanisms 65
[17]. Self-attention consists in computing relationships between parts of an input signal in order for the model 66
to determine which are important for the task. Following their success in natural language processing (NLP) 67
[3] and multimodal data processing [18], Transformers have been recently applied to neural recordings for 68
behaviour prediction, incorporating precise spike times and unit identities [19]. 69
70
In this work, we propose to use the capabilities of Transformers by designing a self-attention mechanism that 71
leverages single spikes as tokens, that is, individual units in the signal. The Hebbian attention layer 72
introduced in this work relies on computing relationships between spike trains using the biologically inspired 73
mechanism of Hebbian learning. Our main focus is to obtain expressive and robust latent representations to 74
analyse neural recordings across sessions and animals, with the added benefit of state-of-the-art predictive 75
power. Consequently, we rely on a variational autoencoder (VAE) model [20] to produce the desired 76
Figure 1 Overview of SPARKS. (A) Architecture of the proposed variational autoencoder. Blocks 𝝁(⋅) and 𝝈(⋅)
represent the mean and standard deviation of the latents distribution. (B) Illustration of the predictive conditional
distribution: at every time-step, a window 𝜏! of past latent embeddings is used to predict the future 𝜏" samples from the
Reference
signal (C) Detailed view of the Hebbian attention layer. (D) Schematic of the attention blocks, adapted from
[17]. (E) Encoder architecture.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
embeddings, rather than a Transformer architecture (Figure 1A). Note that conventional, masked VAEs were 77
recently proposed to model neural distributions [21]. Our approach differs in two ways: firstly, instead of 78
fully connected and conventional layers, the encoder consists of attention blocks, the first of which is the 79
proposed Hebbian attention block. Secondly, we introduce a novel criterion to train the system: taking 80
inspiration from predictive coding, the main objective of the approach is to maximize the probability for the 81
system to predict the correct output for a number of time-steps in the future given the current latent 82
embeddings (Figure 1B, Methods). Importantly, a single model can be used to directly compare the latent 83
embeddings obtained from recordings across sessions or animals within the same latent space. 84
85
In this work, we have developed a Sequential Predictive Autoencoder for the Representation of spiKing 86
Signals (SPARKS), and demonstrate that the approach offers several key advantages: training on raw spiking 87
signals or calcium imaging data with minimal preprocessing; directly interpretable low-dimensional latent 88
embeddings; state-of-the-art predictive power from few neurons; unsupervised exploration of the data; and 89
robust encodings across repetitions, sessions, animals, and brain areas. We showcase SPARKS’ capabilities 90
using diverse datasets, highlighting its potential to uncover the underlying topology of neuronal 91
representations and its generalisation capabilities across different sessions and animals. We attribute the wide 92
abilities of SPARKS to the use of a Transformer encoder, equipped with our biologically inspired Hebbian 93
attention layer, and training via the proposed objective, which in combination allows us to sequentially 94
extract information at the single spike level with unparalleled precision. 95
96
Obtaining Latent Embeddings using a Variational Autoencoder 97
with Hebbian Attention 98
To obtain consistent latent embeddings that capture most of the variance from high dimensional neuronal 99
responses, SPARKS combines a variational autoencoder with the Hebbian attention layer and predictive 100
learning rule introduced in this work ([10], Figure 1A-B). Following the seminal work [17], the encoder 101
comprises several attention blocks, each composed of an attention layer (Figure 1C), followed by a fully 102
connected feedforward network with residual connections and batch normalisation (Figure 1D-E). The first 103
attention block implements the Hebbian self-attention mechanism that we detail in the next section, with the 104
following blocks implementing conventional dot-product attention ([17], Figure S 1B). To obtain the latent 105
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
representation of the input data, the output of the last attention block is flattened and projected into the 106
desired latent dimension 𝑑. Throughout, the decoder is a fully connected feedforward neural network, tasked 107
to predict a desired reference signal. The reference signal can be chosen as a general supervisory signal such 108
as behavioural data or sensory inputs to perform supervised learning, or as the input spike train itself to 109
perform unsupervised learning. As we will demonstrate, both approaches can be used in complement of each 110
other. When trained via unsupervised learning, our model already captures most of the variance encoded in 111
the neural data, which allows one to freely explore the data using the low-dimensional representations 112
produced by the encoder. This is particularly interesting to investigate how complex combinations of external 113
and internal parameters are encoded in the data. Such an analysis can be completed by training the model via 114
supervised learning to understand the encoding of external parameters that were recorded during the 115
experiments. 116
117
We train the proposed autoencoder through a novel objective which is obtained by extending causally 118
conditioned distributions [22], [23]. Causally conditioned distributions model the probability of a random 119
variable (RV) given its own past and that of other RVs of interest: in this context, we maximise the 120
probability to output the desired reference signal given the latent embedding produced by the encoder. More 121
specifically, we introduce a novel criterion based on the predictive causally conditioned (PCC) distribution, 122
which consists in maximising the likelihood of future samples from the reference signal given a window of 123
past samples from the latent space (Figure 1B). The reference signal can be chosen either as a relevant 124
timeseries such as behavioural data, or as the input neural signal (Methods). The use of the PCC is inspired 125
by the predictive coding theory, namely that the encodings produced by the generative model in the brain 126
constantly aim at predicting future sensory inputs. 127
128
Applying Self-Attention to Neural Data 129
Self-attention is a key component of the Transformer architecture [17] which consists in computing a 130
similarity measure between different parts of a given input signal to extract contextual information. For 131
instance, in NLP, self-attention measures, for each word in a sentence, the relative importance of the other 132
words to obtain semantic information. How to apply this technique to neural data remains an open question. 133
A growing corpus of works have directly applied traditional attention mechanisms to spiking signals [24], 134
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
[25], [26], but they typically relied on the implicit assumption that spike trains were obtained by the encoding 135
of natural, real-valued signals into spikes. Such signals possess specific spatial and/or temporal structures, but 136
it is unclear if such assumptions stand for neural recordings. 137
138
To overcome this challenge, we propose a self-attention mechanism for neural data based on a biologically 139
inspired principle: we make use of Hebbian mechanisms, which have long been suggested as one of the key 140
mechanisms underlying learning in biological neural networks [27], [28]. This allows us to apply attention 141
spatially across neurons at every time-instant in the recording, with temporality maintained in the model via 142
the computation of eligibility traces for every pair of neurons, that act as built-in temporal encodings (Figure 143
1C, Figure S 1A). This is unlike conventional dot-product attention, which relies on various temporal 144
embedding techniques to introduce temporality across tokens presented simultaneously to the model ([17], 145
Figure S 1B). 146
147
More specifically, we compute spatial attention between neural sequences as the synaptic coefficients 148
obtained at every time instant via spike-time dependent plasticity (STDP). These coefficients are computed 149
between each pair of neurons from the recording of interest, regardless of prior knowledge of connectivity 150
between neurons. The computation of attention coefficients is carried out by maintaining eligibility traces for 151
each pair of neurons, which are calculated using weighted averages of the neurons’ spiking history: they are 152
potentiated when either neuron spikes, and decay exponentially otherwise (Figure 1B). The time constant of 153
the exponential decay and the values by which the pre- and post-synaptic traces are updated at spike times 154
could be measured [29] but are a priori unknown, and we propose to learn these values via backpropagation 155
individually for each pair of neurons. Our method can also be readily adapted to operate using calcium traces 156
obtained through imaging (Methods). The attention coefficients obtained via STDP are stored in a matrix 157
𝑨! ∈ ℝ"×", which can be likened to the dot-product between keys and queries in the conventional dot-158
product attention [17] (Figure 1C, Figure S 1B). These values are then projected into a matrix 𝑾$ of 159
dimension 𝑁 × 𝑑% where 𝑑% is chosen such that 𝑑% ≪ 𝑁 to reduce computational complexity when the 160
number of neurons is high. 161
162
As we will demonstrate throughout, the implementation of an attention mechanism via causal STDP enables 163
the model to capture sequential information. This is key to analysing neural data from brain regions where 164
information is encoded into the precise order of spikes, such as in the entorhinal-hippocampal system during 165
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
spatial navigation [30], as highlighted in the following section. 166
STDP as an Attention Mechanism Captures Sequential Information 167
168
Figure 2 SPARKS reveals oscillatory sequences from MEC data. (A) Illustration of calcium recording in the MEC of 169
head-fixed mice freely running on a wheel, reproduced from [31]. (B) Raster plot of thresholded calcium signals of all 170
cells in an example session (bin size = 129 ms, 𝑛 = 484 cells). Cells were sorted according to their correlation values 171
with one arbitrary cell, in a descending manner. Minutes-long oscillatory sequences appear after sorting. (C) Snippet of 172
pre- and post- synaptic firing patterns (top) and corresponding coefficients obtained after application of STDP (middle), 173
and an alternative symmetric rule (bottom). (D) Raster plot for left-out data (left) and corresponding predicted spike 174
probabilities after training with SPARKS (centre), and the benchmark symmetric rule (right). (E) Two-dimensional 175
embeddings obtained from SPARKS for the whole session after unsupervised training. (F) Oscillatory sequences after 176
ordering of the neuron obtained via SPARKS (top), correlation (middle), and PCA sorting (bottom), and corresponding 177
power spectral densities (PSDs, right). 178
We begin by motivating the use of STDP over alternative Hebbian rules to capture sequential information in 179
neural data. To do so, we analysed recently published data from the medial entorhinal cortex (MEC), where 180
the authors uncovered oscillatory sequences that are propagated through the MEC in the scale of tens of 181
seconds to minutes (Figure 2A-B, [31]). In such sequences, the information is encoded both spatially and 182
temporally, and analysing them requires keeping track of which neurons fired and in what order. We 183
evaluated the relevance of the proposed attention mechanism by comparing it to an alternative symmetric 184
rule, whereby attention coefficients are potentiated regardless of the order in which pre- and post-synaptic 185
neurons fired (Figure 2C). 186
187
We considered a single session recorded with calcium imaging, that lasted roughly 3,600 seconds, and 188
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
trained both SPARKS and the alternative rule via unsupervised learning using the first 3,100 seconds of the 189
recording, to test them on the remaining 500 seconds. We found that both rules obtained a similar test binary 190
cross-entropy (BCE) score, but the symmetric rule was unable to recover the oscillatory sequences from the 191
spiking sequence, as measured by the oscillatory score proposed in [31] (Figure 2D, Figure S 2A, Methods). 192
Furthermore, we found that the 2-dimensional embedding of this data produced by SPARKS follow circular 193
trajectories, hence recovering the ring-shaped topology uncovered in [31] (Figure 2E). This was obtained 194
regardless of whether we ordered the neurons prior to training the model, which is necessary to highlight the 195
oscillatory sequences from this data. We were not able to recover such structure from the embeddings 196
produced by the model with the symmetric rule (Figure S 2B). Interestingly, we also found that if the time 197
constant of the exponential decay 𝜏& and the values by which the pre- and post-synaptic traces are updated at 198
spike times were not learnable, the embeddings produced did not allow us to recover oscillations (Figure S 199
2C). 200
201
From the latent embeddings produced by SPARKS, we were able to retrieve the phase of the oscillations, 202
which allowed us to order neurons according to their correlation to the phase signal (Figure S 2D, Methods), 203
and to obtain clear oscillatory sequences (Figure 2F). To assess the quality of the resulting sorting [31], we 204
calculated the height of the largest peak in the power spectral density (PSD) of the 1-dimensional signal 205
obtained by population decoding of the spike train (Figure 2F, Figure S 2F-G, Methods). SPARKS’ sorting 206
had a peak power of 0.05 mV'. Hz( ) , outperforming the sorting obtained through correlation (0.035 207
mV'. Hz( ) ) and principal component analysis (PCA, 0.007 mV'. Hz( ) ). Overall, by incorporating STDP to 208
capture temporal information from neuronal recordings, SPARKS can uncover the underlying topology of 209
neuronal representations. In the following sections, we further illustrate the numerous capabilities of the 210
model by using datasets acquired from various brain regions through different recording techniques. 211
212
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
Expressive, Low-Dimensional and Robust Encodings 213
214
Figure 3 Expressive latent embeddings and state-of-the-art prediction from somatosensory cortex data. (A) Illustration 215
of a single trial in the reaching task. (B) Average hand positions and corresponding latent embeddings obtained through 216
SPARKS supervised, CEBRA supervised and SPARKS unsupervised learning. (C) Comparison of 𝑅# scores for prediction 217
of hand position. Orange lines represent the median over 5 runs, and each dot an individual run. (D) Evolution of 218
supervised prediction score with latent space dimensionality. (E) Unsupervised vs. supervised prediction scores for the 219
same target variables. Each value represents the mean over 5 individual runs. (F) Consistency of latent embeddings for 220
supervised learning across hyperparameter values: past window (𝜏!), future window (𝜏") and STDP temporal decay (𝜏$). 221
(G) Example unsupervised latent embeddings across different model initialisations. (H) Consistency of latent embeddings 222
for unsupervised learning across prediction window length 𝜏". Shaded areas represent standard deviation over 5 223
individual runs. ***: p < 0.001, Mann-Whitney U-test. r: Pearson correlation coefficient. p: p-value, t-test. 224
225
Next, we consider data obtained from electrophysiology recordings in Brodmann’s area 2 of the 226
somatosensory cortex in monkeys performing a centre-out task (Figure 3A). We use a single recording from 227
the Area2 Bump data set [32], comprising 65 neurons, which was proposed as part of the Neural Latents 228
Benchmark competition [33]. We trained SPARKS via supervised learning to decode the continuous position 229
of the animal’s hand, unless mentioned otherwise. 230
231
One of our main goals is to produce latent representations of the data that can be directly related to relevant 232
parameters. In this case, the latent embeddings obtained via supervised learning are seen to closely match the 233
trajectories of the animal’s hand (Figure 3B). Interestingly, even when training the model via unsupervised 234
learning, we found that the latent trajectories produced by SPARKS are clearly separated across different trial 235
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
conditions. The state-of-the-art model [16] was unable to produce embeddings that can be visually matched 236
to the desired reference trajectories, despite providing representations that are clearly separated across trial 237
directions (Figure 3B). To evaluate the role of the proposed attention mechanism, we trained alternative 238
models using the proposed objective. We found that traditional autoencoders did not produce latent 239
embeddings that allowed us to distinguish between different target directions, regardless of the encoder being 240
recurrent or not (Figure S 3A), which we assumed to be necessary to process time-series data. We also 241
trained a model for which we replaced our Hebbian attention layer with conventional dot-product attention. 242
This model produced less visually informative embeddings, notably at the movement onset (Figure S 3A). 243
Alongside the findings from the previous section, these results motivate the use of the Hebbian attention layer 244
introduced in this work to produce low-dimensional representations of neural recordings. 245
246
247
SPARKS also offers very high predictive power, allowing us to obtain highly accurate reconstruction of tasks 248
variables at a single trial level, such as the trajectory of the hand position in this reaching task (Figure 3C). 249
When training the model via supervised learning, predictive power was quantified as the 𝑅' score obtained 250
using 20% of examples held out for testing (Figure 3C, Methods). This metric is directly obtained from the 251
decoder’s outputs, and we obtain 𝑅' = 91.2% for the decoding of hand position with 𝑑 = 3 latent 252
dimensions. SPARKS outperformed the state-of-the-art benchmark [16], for 3, and even 16 latent dimensions 253
(Figure 3D, R' = 65%, 𝑝 < 0.001; and R' = 88.6%, 𝑝 < 0.001, respectively; one-sided Mann-Whitney U-254
test), as well as all the other models that we implemented (Figure S 3B). For the model proposed in reference 255
[16], predictions were obtained using an auxiliary prediction model, namely a k-nearest neighbour (KNN) 256
regression model. When training SPARKS via unsupervised learning, we can similarly decode hand positions 257
from the latent embeddings via a KNN model. Remarkably, we found that our unsupervised model still 258
achieves higher accuracy than the supervised version of the benchmark, achieving R' = 68.2% with 3 latent 259
dimensions (Figure 3D, 𝑝 < 0.001, one-sided Mann-Whitney U-test). We hence investigated how model 260
parameters affected the prediction capabilities of SPARKS. We found that increasing the length of the past 261
samples window 𝜏* from 1 to 10 ms improved the performance of the model, but performance plateaued 262
beyond this value (Figure S 3C), hinting at relevant timescales of population code and synaptic transmission. 263
For supervised learning, increasing the value of the prediction window 𝜏+ decreases the prediction 264
performance of the model, although it strongly improved model performance for unsupervised learning 265
(Figure S 3D). We also found that selecting a value of the temporal decay 𝜏& that is relevant to the timescale 266
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
of the behaviour was necessary for the model to perform well, with values in the range of 100 to 500 ms 267
performing best. Finally, increasing the latent dimension slightly increased the accuracy of the model, 268
reaching 𝑅' = 93% with a 64-dimensional embedding (Figure 3D), which is relevant to model higher-269
dimensional behaviours [34]. However, note that the high performance SPARKS consistently achieves with 270
only three latent dimensions is key for obtaining highly informative visualizations of neuronal 271
representations. 272
273
In the absence of relevant parameters to perform supervised learning, our model still offers the capacity to 274
uncover the information encoded in the neuronal responses in an unsupervised fashion. To illustrate this, we 275
trained KNN regression models to predict various behavioural variables from the embeddings produced via 276
unsupervised learning. Specifically, we selected the five variables—hand position and velocity, target 277
direction, moment of force applied to the controller, and muscle length—that exhibited the highest 278
unsupervised accuracy. Subsequently, we independently trained our model via supervised learning using each 279
variable as reference signal. We found that the scores obtained from supervised learning exhibit a remarkably 280
high positive correlation with those obtained from unsupervised learning (Pearson correlation coefficient 𝑟 =281
0.91, 𝑝 < 0.01, Figure 3E). The unsupervised model hence encodes into the latent space information about 282
the behavioural parameters found in the neural signal, weighted in accordance with their importance in the 283
spike trains. The resulting latent trajectories can be used to investigate biological questions, for instance by 284
exploring which dimensions encode different trial types, outcomes, or sensory information. 285
286
Finally, the encodings obtained with SPARKS are robust, both when obtained through supervised and 287
unsupervised learning. To quantitatively evaluate consistency across latent spaces, we trained linear 288
regression models between the embeddings obtained across different repetitions of the training procedure 289
with random initialisations, and evaluated the resulting predictive 𝑅' score. For supervised learning, the 290
consistency of the embeddings also remains high for all reference signals (Figure S 3E), with a positive 291
correlation between supervised reconstruction accuracy and consistency (Figure S 3F). This suggests that the 292
more informative the neural recording is about the reference signal, the more the latent representation is 293
constrained by the data. We also found very high levels of consistency for a large range of past samples 294
window length 𝜏*, predictive window 𝜏+ and STDP temporal decay 𝜏& (Figure 3F), but also other model 295
parameters (Figure S 3G). However, the latent representations derived from distinct behavioural variables 296
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
cannot consistently be related to one another via linear regression (Figure S 3E), hinting at the fact that the 297
information encoded for each variable resides within separate lower-dimensional manifolds. When training 298
via unsupervised learning with different random weight initialisations, SPARKS also produces embeddings 299
that are visually similar across repetitions (Figure 3G). Increasing the number of future samples predicted up 300
to 𝜏+ = 10 ms improves the consistency of embeddings across random initialisations (Figure 3H), which can 301
be linked to the optimal value for the parameter 𝜏* for supervised learning. Taken together, these results 302
demonstrate how SPARKS produces meaningful and robust low-dimensional representations of neural data, 303
providing state-of-the-art performance with only tens of neurons. 304
305
Reconstructing Visual Stimuli in Passive Mice 306
307
308
Figure 4 Frame reconstruction and prediction from primary visual cortex recordings. (A) Schematic of Neuropixels 309
recordings in passive mice viewing a naturalistic film. (B) Illustration of the frame reconstruction task: SPARKS was 310
trained to reconstruct actual frames from neural recordings. (C) Example frames from the naturalistic movie and 311
corresponding reconstructions using 100 neurons recorded from mouse primary visual cortex. (D) 1s decoding accuracy 312
as a function of number of neurons for prediction within block (yellow), across two recording blocks separated by 90 313
minutes (brown), and when training on data from both blocks (black). Within block decoding accuracy is shown for [16] 314
(pink). Error bars represent standard deviations over five repetitions with random sampling of individual units from all 315
recordings. (E) Representative examples of two-dimensional embeddings obtained from the neuronal responses to 316
natural movies. From left to right: SPARKS trained for frame reconstruction, SPARKS trained for frame prediction, 317
CEBRA trained for frame prediction, SPARKS trained via unsupervised learning. (F) Normalised pairwise distances 318
between frames, computed as inverse, normalised 2-dimensional correlation coefficients; and normalised Euclidian 319
distances between 2-dimensional frame embeddings produced by each model, obtained on a single test example. Same 320
order as in (D). 321
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
322
Next, we asked how our model can be used to decode higher-dimensional stimuli such as natural movie 323
responses, as opposed to low-dimensional parameters such as the direction of the target in the previous task. 324
To this end, we analysed Neuropixels recordings in the primary visual cortex obtained from the Allen visual 325
coding dataset [35]. We focused on passive recordings in mice viewing a 30 second excerpt from the opening 326
of the movie Touch of Evil (Figure 4A). In each session, the movie was presented 20 times in two blocks of 327
10 repetitions, separated by approximately 90 minutes. As illustrated in Figure 4B, we tackled the task of 328
reconstructing individual movie frames from neural activity. To do so, we constructed a pseudo-mouse by 329
stacking the recordings from individual units, chosen at random across all 58 sessions. We started by training 330
SPARKS on the first 9 presentations of the movie from the first block, leaving the last one out for testing, a 331
procedure that we refer to as testing within block. We show examples of reconstructed frames in Figure 4C, 332
demonstrating that the model captures high-level scene context, as well as small, fast-moving details (Suppl. 333
Movie 1). 334
335
For numerical evaluation, we tested our model on the easier task of predicting frame IDs (i.e., indices ranging 336
from 0-900, 30s movie, 30 fps, Figure 4D). Note that the state-of-the-art reference [16] used as benchmark 337
did not operate directly on the recordings, but first extracted features from the neural data using a 338
Transformer network [16]. Yet, for prediction within block, our model is seen to outperform the best results 339
obtained with the benchmark ([16], 87.4% accuracy, 𝑁 = 1,000) using 5 times fewer neurons (SPARKS, 340
92.3% accuracy, 𝑁 = 200, 𝑝 < 0.01, one-sided Mann-Whitney U-test), suggesting that SPARKS is able to 341
capture key information from the neural data within a small neuronal population. 342
343
We also tackled the more challenging task of prediction across blocks. To this end, we tested our model on 344
the second set of 10 presentations after training on all examples from the first block, with the two blocks 345
separated by 90 minutes. Analysing neural activity recorded over periods ranging from tens of minutes to 346
hours or more is known to be a difficult task due to a phenomenon known as neural drift [36], [37], [38], 347
[39]. Despite these challenges, we found that our model maintains remarkably high accuracy, and even 348
performs as well for predictions across blocks as it does within the same block when 𝑁 = 1,000 neurons are 349
used. 350
351
In the latent space, SPARKS assigns data points from test examples in the second block close to where 352
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
corresponding points from the training set lie in the embedding space (Figure S 4B). The model is hence 353
capable of ignoring the noise caused by neural drift to only extract the meaningful information from the 354
neural data. However, we obtained worse performance and higher variance across repetitions for prediction 355
across blocks when using fewer than 1,000 neurons, indicating that neurons are unevenly impacted by the 356
effects of neural drift across sessions and animals. We also found that training using the first 9 repetitions 357
from both blocks maintains the performance of the model (Figure 4D) compared to training using a single 358
block. This highlights the capacity of SPARKS to find robust latent representations across neural recordings 359
captured over long periods. Additionally, we performed frame prediction using spiking data obtained from 360
pre-processed calcium recordings sessions from the same dataset [40]. We found that SPARKS still performs 361
similarly when decoding information from calcium dynamics (Figure S 4C), establishing our algorithm as the 362
current state-of-the-art for prediction from both types of neural recordings. 363
364
To evaluate how the latent representations can give insights about the structure of the input information, in 365
this case the natural scenes, we compared the embeddings obtained through supervised and unsupervised 366
learning after training on the first block using 200 randomly chosen units. Training was carried out via 367
unsupervised learning, or via supervised learning by performing either frame reconstruction or classification. 368
After training via supervised learning, we found that the embeddings produced with both targets form neural 369
trajectories that are closely matched (Figure 4E). Although the neural trajectory obtained via unsupervised 370
learning differs from the other two, we still find that the embeddings of frames from different sections of the 371
movie are clearly distinct in the latent space. To quantitatively evaluate these embeddings, we measured 372
normalised pairwise distances between frame embeddings produced by each model. As a reference, we also 373
computed normalised distances between movie frames (Figure 4E, Methods). We find that in each case, the 374
frame embeddings produced by SPARKS are close in space for frames that are close in time. Remarkably, a 375
structure that can be likened to that of the movie correlations emerges not only for the supervised model, but 376
also for the one trained via unsupervised learning (Figure 4F). Conversely, despite enforcing very strong 377
structure, the embeddings produced by the benchmark model closely position in the latent space frames that 378
are distant in time or correspond to different movie scenes. We were not able to recover such structure from 379
either movie or neural correlations (Figure 4F, Figure S 4D), which implies that previous methods that ignore 380
the sequential nature of neural signals do not allow for their faithful representation. 381
382
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
Learning Across Sessions and Animals 383
384
385
Figure 5 SPARKS provides stable embeddings across sessions and animals. (A) 1s frame decoding accuracy for 386
prediction of frame IDs with SPARKS for 5 example sessions. Orange lines represent the median over 5 runs, and each 387
dot an individual run. (B) Schematic illustrating procedure for training over different recordings with SPARKS. The 388
multisession model is obtained by learning a different Hebbian attention layer for each session while keeping the rest of 389
the network shared across sessions. (C) Decoding accuracy of models trained on all sessions and tested on individual 390
sessions versus models trained and tested on single sessions. Each value represents the mean over 5 individual runs. (D) 391
Two dimensional embeddings corresponding to different sessions obtained from the multi-session model. (E) Frame 392
decoding accuracy as a function of model size. Orange lines represent the median over 5 runs, and each dot an individual 393
run. (F) Frame decoding accuracy as a function of finetuning epochs. Green trace corresponds to the pre-trained 394
dataset. Blue trace shows accuracy for a newly introduced dataset to an existing model. Shaded areas represent standard 395
deviation over 5 pre-trained models. 396
397
Despite recent developments [5], [41], obtaining stable chronic electrophysiological recordings in mice 398
remains very challenging due to technical limitations such as tissue inflammation and positional changes of 399
the electrodes over time. As a result, researchers often rely on acute recordings. Combining such recordings 400
is common practice and particularly relevant when the brain region of interest allows for the recording of 401
only a few neurons at a time [35], [42]. This approach, however, has the additional confounding effect of 402
masking differences in animal behaviour and state across trials and sessions, even when experimental 403
conditions and stimulation protocols are carefully controlled [43], [44]. Here, we illustrate how SPARKS can 404
tackle this challenge in the task of predicting frames IDs from the Allen Visual dataset. We first trained 405
models using all the neurons from 5 randomly selected sessions, each recorded within a single block. We 406
found a high variance in the predictive accuracy across sessions (Figure 5A). This effect is not explained by 407
the number of neurons in each recording (Figure S 4A, small negative correlation 𝑟 = −0.64, 𝑝 > 0.1), 408
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
reinforcing the idea that not all recordings from the same brain region are equally informative. Interestingly, 409
despite the high variance in the predictive power across sessions, the embeddings produced by the model 410
remain highly consistent, as measured by the predictive 𝑅' obtained after fitting a linear prediction model 411
between the embeddings produced from different recordings (𝑅' > 99% across all pairs, Figure S 4E). To 412
demonstrate robust biological findings, it is necessary to compare embeddings across sessions within the 413
same latent space, which may require extensive work when training the model on single sessions. As 414
discussed, stacking recordings to create pseudo-recordings with hundreds or thousands of units can mitigate 415
this issue. However, it is not always practical to do so, especially for tasks where trials vary in length due to 416
the varying response times of the animals. Combining recordings also ignores varying behavioural and 417
cognitive animal states, such as level of arousal. 418
419
To overcome this issue, SPARKS can be employed to encode data from several sessions within the same 420
latent space. This is achieved by learning a different Hebbian attention layer for each session while keeping 421
the rest of the network shared across sessions (Figure 5B). When applying this procedure to the same 5 422
sessions from the Allen visual dataset, we find that the resulting model maintains predictive accuracy similar 423
to that obtained when training on single sessions, with a very high positive correlation between the two 424
(Figure 5C, 𝑟 = 0.94, 𝑝 < 0.01). Furthermore, we find that the resulting embeddings from each session are 425
closely related (Figure 5D). This demonstrates the capability of our method to find a common latent space 426
which reliably represents the encoding of this natural stimulus within the primary visual cortex. Our results 427
are scalable across model dimensions, which remain small compared to modern LMM architectures (Figure 428
5E, [45], [46], [47]), hence offering the possibility to train our model without access to the latest GPU 429
architectures. 430
431
Finally, analysing new sessions from a previously trained model is straightforward via addition of a new 432
Hebbian attention layer and finetuning (Figure 5F). We find that the model reaches its final accuracy on the 433
new session within ~10 finetuning epochs, with no loss of accuracy when tested on the sessions it was pre-434
trained on, and reaches similar accuracy to that of a model trained on the single additional session (Figure S 435
4G). Overall, SPARKS is seen to offer an efficient, powerful and scalable solution to the analysis of neural 436
data across sessions and animals by bringing within the same latent space representations from different 437
datasets. This can then be used to explore how, for example, behavioural variables change the embeddings 438
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
across mice. 439
440
Uncovering the Functional Hierarchy of Mouse Visual Cortical 441
Areas 442
443
Figure 6 SPARKS recovers the hierarchy of the visual cortex. (A) Neuropixels recordings in passive mice viewing static 444
gratings. (B) Organisation of mice visual cortical areas. (C) Evolution of the average spatial frequency prediction error 445
over time for each visual cortical area. Lines represent average over 5 individual runs. (D) Average prediction error for 446
each visual cortical area and spatial frequency. (E) Evolution of the prediction error over time for each visual cortical 447
area and spatial frequency. Lines represent average over 5 individual runs, shaded areas represent standard deviations. 448
(F) Illustration of the computation of projection scores between visual cortical areas. (G) Projection scores between all 449
visual cortical areas. (H) Functional hierarchy obtained with SPARKS, calculated by computing the difference between 450
their incoming and outgoing projections. 451
452
An important capability of our model is its ability to leverage the temporal dynamics of neuronal data [48] to 453
reveal biological insights that can extend to large neuronal networks. To illustrate this, we studied multi-454
region interactions from Neuropixels recordings obtained from the mouse primary visual cortex (VISp) and 455
five higher-order visual cortical areas (HVAs), latero-medial area (LM), antero-lateral area (AL), rostro-456
lateral area (RL), postero-medial area (PM) and antero-medial (AM) from the Allen visual coding dataset 457
(Figure 6A-B) [49]. We trained separate models using 700 neurons recorded in each HVA from awake, head-458
fixed mice passively presented with static gratings with various spatial frequencies. We trained SPARKS to 459
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
predict the spatial frequency of the grating presented to the animals at each time-step with a temporal 460
resolution of 1 millisecond. We focused on spatial frequency because the time course of responses in the 461
primary visual cortex of rodents and primates has been linked to spatial frequency preference, consistent with 462
the coarse-to-fine theory of visual processing [50], [51], [52]. We reasoned that the temporal encoding of 463
spatial frequency across HVAs would be informative of the hierarchical organization of the visual system. As 464
seen in Figure 6C, the models trained using data from all HVAs attain high accuracy across spatial 465
frequencies, with a final average error lower than 0.05 cycles/degree. All of the models start their prediction 466
approximately 50ms from stimulus presentation, but show different dynamics, with VISp reaching its highest 467
accuracy first. We found that models trained on data from visual areas that are higher in the functional 468
hierarchy [49], [53] demonstrate slower dynamics but attain higher accuracies. This general picture hides 469
differences in the areal response to various spatial frequencies: for instance, VISp achieves the lowest error of 470
all areas only for the lowest spatial frequency (Figure 6D). There are also large differences in how quickly 471
gratings are encoded across areas and spatial frequencies; for example, area VISrl provides a very fast but 472
poor estimation of high spatial frequencies, while being more accurate but slower for medium spatial 473
frequencies (Figure 6E). 474
475
To uncover dynamic interactions between the different HVAs during bottom-up visual stimulation, we 476
formulated the following hypothesis: an area that generates an accurate representation of a given spatial 477
frequency will selectively transmit this information to the other areas it projects to, provided that these areas 478
have a higher prediction error for the same spatial frequency. In order to remain agnostic and avoid 479
assumptions about connectivity between these areas, we assume that all HVAs are densely connected to each 480
other, and for each pair of areas (𝑖, 𝑗), we computed a projection score from area 𝑖 to area 𝑗 as the differential 481
integral between the temporal prediction between HVAs 𝑖 and 𝑗 (Figure 6E). Considering only positive 482
interactions, we averaged across spatial frequencies and obtained a matrix of projection scores (Figure 6F, 483
Methods). Similarly to previous works [49], [54], we found that VISp is the area with most outgoing 484
projections, while area VISam receives the most inputs from other regions. We then computed hierarchy 485
scores for each area, defined as the difference between incoming projections to the area and outgoing 486
projections from that area. The functional hierarchy we obtained from the temporal encoding of spatial 487
frequencies (Figure 6H) matches the anatomical and functional organization of mouse cortical visual areas 488
previously described [49], [53], demonstrating SPARKS’ capability to account for the temporal dynamics of 489
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
neuronal signals to reveal large-scale network interactions. 490
491
Discussion
492
Motivated by the advancement of large-scale neuronal recordings and generative machine learning 493
techniques, we introduced SPARKS, a biologically inspired attention model for the analysis of neuronal 494
signals. This new technique can be used to obtain low dimensional embeddings that meaningfully represent 495
the information content from high dimensional data with minimum pre-processing. SPARKS addresses 496
previous limitations by catering to the sequential nature of neuronal signals, and processing information at the 497
single spike level with no need for data processing beyond spike sorting. The high predictive power and 498
latent embeddings obtained with our model allowed us to extract meaningful biological insights from existing 499
data, as validated by revealing temporal dynamics across brain regions, such as the MEC, somatosensory and 500
visual cortex during passive and active behaviours and across different recording techniques. 501
502
The core innovation in SPARKS’ architecture lies within the Hebbian attention layer, which adapts the highly 503
effective attention mechanism to process spiking signals. Implemented within Transformer architectures, 504
attention is at the centre of recent AI successes, with the development of highly performing large language 505
models (LLMs) and large multimodal models (LMMs) [45], [46], [47], [18]. Attention mechanisms 506
selectively focus on relevant input data, and by assigning weights to relevant inputs, emphasise important 507
features while suppressing irrelevant ones. This enables models to process information more efficiently and 508
effectively, improving performance in various tasks. Conventional attention has been proposed as a model of 509
sparse distributed memory, highlighting the biological relevance of the approach [55]. By taking the inverse 510
approach to derive an attention mechanism from a biological principle, we achieved state-of-the-art 511
performance on various benchmarks, demonstrating the relevance of using the synergies between AI systems 512
development and our biological understanding of intelligence. This is reinforced by the proposed learning 513
Objective
using predictive causally conditioned distributions, inspired by predictive coding in the brain, 514
which is key to obtaining highly reliable latent embeddings. This is in line with recent work highlighting the 515
benefits for LLMs to simultaneously predict several tokens [56]. 516
517
Our approach is further justified from the perspective of the Bayesian brain hypothesis, modelling the 518
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
autoencoder architecture as a Bayesian encoding-decoding machine [57]. The neural signal measured and 519
forming the inputs 𝒙 ∼ 𝑝(𝒙|𝒛, 𝒓) to our model depends on both the animal’s internal generative model of the 520
world 𝑝(𝒛|𝒓), and its sensory inputs 𝒓. In our model, the encoder performs the approximate posterior 521
inference 𝑝, (𝒛|𝒙, 𝒓) given the neural response to the current sensory inputs and parametric weights 𝒘. Due 522
to the noise inherent to neural coding and the probabilistic nature of the encoding 𝒛, the decoding network is 523
a Bayesian decoder tasked to estimate 𝑞, (𝒓|𝒛, 𝒙). By training encoding and decoding networks together, we 524
obtain latent embeddings that are highly consistent across animals and sessions. This aligns with the Bayesian 525
brain hypothesis that the brain’s internal model is calibrated through evolution and exposure to natural 526
stimuli, and evolves slowly during adulthood. We suggest that our model provides an estimate of the animal’s 527
latent generative model of the world, providing a new milestone towards a foundation model for neuroscience 528
[58]. 529
530
Although the memory requirements of the Hebbian attention layer scale quadratically with the number of 531
neurons in the recording, we demonstrated results considering up to 1,000 neurons trained on a single mid-532
range GPU. By approximating the costly backpropagation-through-time algorithm (Methods), our method is 533
agnostic to the recording duration in terms of memory requirements. As a result, SPARKS will be suitable for 534
analysing electrophysiological data as the scale of these recordings increases with technological 535
advancements. However, the technique cannot yet scale to the latest imaging datasets, comprising up to 536
1,000,000 neurons recorded simultaneously. A limitation of our approach is the computation of attention 537
coefficients between each neuron in the recording, which is not in line with the sparse connectivity measured 538
in the brain. To tackle the challenge of analysing such data, future research directions notably aim to 539
incorporate connectivity data into the model, and to use recent sparse approximations of traditional attention 540
mechanisms [59]. 541
542
References
543
544
[1] N. A. Steinmetz et al., ‘Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain 545
recordings’, Science, vol. 372, no. 6539, p. eabf4588, Apr. 2021, doi: 10.1126/science.abf4588. 546
[2] C.-H. Yu, J. N. Stirman, Y. Yu, R. Hira, and S. L. Smith, ‘Diesel2p mesoscope with dual independent 547
scan engines for flexible capture of dynamics in distributed neural circuitry’, Nat. Commun., vol. 12, no. 548
1, p. 6639, Nov. 2021, doi: 10.1038/s41467-021-26736-4. 549
[3] J. Demas et al., ‘High-speed, cortex-wide volumetric recording of neuroactivity at cellular resolution 550
using light beads microscopy’, Nat. Methods, vol. 18, no. 9, pp. 1103–1111, Sep. 2021, doi: 551
10.1038/s41592-021-01239-8. 552
[4] T. Nöbauer, Y. Zhang, H. Kim, and A. Vaziri, ‘Mesoscale volumetric light-field (MesoLF) imaging of 553
neuroactivity across cortical areas at 18 Hz’, Nat. Methods, vol. 20, no. 4, pp. 600–609, Apr. 2023, doi: 554
10.1038/s41592-023-01789-z. 555
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
[5] A. Yuan, J. Colonell, A. Lebedeva, M. Okun, A. Charles, and T. Harris, ‘Multi-day Neuron Tracking in 556
High Density Electrophysiology Recordings using EMD’, eLife, vol. 12, Dec. 2023, doi: 557
10.7554/eLife.92495.1. 558
[6] A. E. Urai, B. Doiron, A. M. Leifer, and A. K. Churchland, ‘Large-scale neural recordings call for new 559
insights to link brain and behavior’, Nat. Neurosci., vol. 25, no. 1, pp. 11–19, Jan. 2022, doi: 560
10.1038/s41593-021-00980-9. 561
[7] M. Safaie et al., ‘Preserved neural dynamics across animals performing similar behaviour’, Nature, vol. 562
623, no. 7988, pp. 765–771, Nov. 2023, doi: 10.1038/s41586-023-06714-0. 563
[8] J. A. Gallego, M. G. Perich, S. N. Naufel, C. Ethier, S. A. Solla, and L. E. Miller, ‘Cortical population 564
activity within a preserved neural manifold underlies multiple motor behaviors’, Nat. Commun., vol. 9, 565
no. 1, p. 4233, Oct. 2018, doi: 10.1038/s41467-018-06560-z. 566
[9] B. Agüera y Arcas, A. L. Fairhall, and W. Bialek, ‘Computation in a single neuron: Hodgkin and Huxley 567
revisited’, Neural Comput., vol. 15, no. 8, pp. 1715–1749, Aug. 2003, doi: 568
10.1162/08997660360675017. 569
[10] B. A. Moser and M. Lunglmayr, ‘Spiking Neural Networks in the Alexiewicz Topology: A New 570
Perspective on Analysis and Error Bounds’, May 09, 2023, arXiv: arXiv:2305.05772. doi: 571
10.48550/arXiv.2305.05772. 572
[11] C. Stringer, M. Pachitariu, N. Steinmetz, C. B. Reddy, M. Carandini, and K. D. Harris, ‘Spontaneous 573
behaviors drive multidimensional, brainwide activity’, Science, vol. 364, no. 6437, p. eaav7893, Apr. 574
2019, doi: 10.1126/science.aav7893. 575
[12] J. C. Marques, M. Li, D. Schaak, D. N. Robson, and J. M. Li, ‘Internal state dynamics shape brainwide 576
activity and foraging behaviour’, Nature, vol. 577, no. 7789, pp. 239–243, Jan. 2020, doi: 577
10.1038/s41586-019-1858-z. 578
[13] M. R. Cohen and J. H. R. Maunsell, ‘A Neuronal Population Measure of Attention Predicts Behavioral 579
Performance on Individual Trials’, J. Neurosci., vol. 30, no. 45, pp. 15241–15253, Nov. 2010, doi: 580
10.1523/JNEUROSCI.2171-10.2010. 581
[14] M. M. Churchland et al., ‘Neural population dynamics during reaching’, Nature, vol. 487, no. 7405, pp. 582
51–56, Jul. 2012, doi: 10.1038/nature11129. 583
[15] C. K. Machens, R. Romo, and C. D. Brody, ‘Functional, But Not Anatomical, Separation of “What” and 584
“When” in Prefrontal Cortex’, J. Neurosci., vol. 30, no. 1, pp. 350–360, Jan. 2010, doi: 585
10.1523/JNEUROSCI.3276-09.2010. 586
[16] S. Schneider, J. H. Lee, and M. W. Mathis, ‘Learnable latent embeddings for joint behavioral and neural 587
analysis’, Oct. 05, 2022, arXiv: arXiv:2204.00673. doi: 10.48550/arXiv.2204.00673. 588
[17] A. Vaswani et al., ‘Attention is All you Need’. 589
[18] S. Yin et al., ‘A Survey on Multimodal Large Language Models’, Apr. 01, 2024, arXiv: 590
arXiv:2306.13549. Accessed: Jul. 09, 2024. [Online]. Available: http://arxiv.org/abs/2306.13549 591
[19] M. Azabou et al., ‘A Unified, Scalable Framework for Neural Population Decoding’, Oct. 24, 2023, 592
arXiv: arXiv:2310.16046. doi: 10.48550/arXiv.2310.16046. 593
[20] D. P. Kingma and M. Welling, ‘Auto-Encoding Variational Bayes’, Dec. 10, 2022, arXiv: 594
arXiv:1312.6114. doi: 10.48550/arXiv.1312.6114. 595
[21] A. Schulz et al., ‘Modeling conditional distributions of neural and behavioral data with masked 596
variational autoencoders’, Apr. 25, 2024, bioRxiv. doi: 10.1101/2024.04.19.590082. 597
[22] G. Kramer, ‘Directed information for channels with feedback’, Doctoral Thesis, ETH Zurich, 1998. doi: 598
10.3929/ethz-a-001988524. 599
[23] N. Skatchkovsky, O. Simeone, and H. Jang, ‘Learning to Time-Decode in Spiking Neural Networks 600
Through the Information Bottleneck’, in Advances in Neural Information Processing Systems, Curran 601
Associates, Inc., 2021, pp. 17049–17059. Accessed: Dec. 04, 2023. [Online]. Available: 602
https://proceedings.neurips.cc/paper/2021/hash/8da57fac3313174128cc5f13328d4573-Abstract.html 603
[24] M. Yao et al., ‘Attention Spiking Neural Networks’, Sep. 28, 2022, arXiv: arXiv:2209.13929. Accessed: 604
Jul. 05, 2023. [Online]. Available: http://arxiv.org/abs/2209.13929 605
[25] Z. Zhou et al., ‘SPIKFORMER: WHEN SPIKING NEURAL NETWORK MEETS TRANSFORMER’, 606
2023. 607
[26] R.-J. Zhu, Q. Zhao, G. Li, and J. K. Eshraghian, ‘SpikeGPT: Generative Pre-trained Language Model 608
with Spiking Neural Networks’, Jun. 26, 2023, arXiv: arXiv:2302.13939. Accessed: Jul. 05, 2023. 609
[Online]. Available: http://arxiv.org/abs/2302.13939 610
[27] D. O. Hebb, The organization of behavior; a neuropsychological theory. in The organization of 611
behavior; a neuropsychological theory. Oxford, England: Wiley, 1949, pp. xix, 335. 612
[28] R. C. Malenka and and R. A. Nicoll, ‘Long-Term Potentiation--A Decade of Progress?’, Science, vol. 613
285, no. 5435, pp. 1870–1874, Sep. 1999, doi: 10.1126/science.285.5435.1870. 614
[29] G. Q. Bi and M. M. Poo, ‘Synaptic modifications in cultured hippocampal neurons: dependence on spike 615
timing, synaptic strength, and postsynaptic cell type’, J. Neurosci. Off. J. Soc. Neurosci., vol. 18, no. 24, 616
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
pp. 10464–10472, Dec. 1998, doi: 10.1523/JNEUROSCI.18-24-10464.1998. 617
[30] G. Buzsáki and E. I. Moser, ‘Memory, navigation and theta rhythm in the hippocampal-entorhinal 618
system’, Nat. Neurosci., vol. 16, no. 2, pp. 130–138, Feb. 2013, doi: 10.1038/nn.3304. 619
[31] S. Gonzalo Cogno et al., ‘Minute-scale oscillatory sequences in medial entorhinal cortex’, Nature, vol. 620
625, no. 7994, pp. 338–344, Jan. 2024, doi: 10.1038/s41586-023-06864-1. 621
[32] R. H. Chowdhury, J. I. Glaser, and L. E. Miller, ‘Area 2 of primary somatosensory cortex encodes 622
kinematics of the whole arm’, eLife, vol. 9, p. e48198, Jan. 2020, doi: 10.7554/eLife.48198. 623
[33] F. Pei et al., ‘Neural Latents Benchmark ’21: Evaluating latent variable models of neural population 624
activity’, Jan. 17, 2022, arXiv: arXiv:2109.04463. doi: 10.48550/arXiv.2109.04463. 625
[34] J. Manley et al., ‘Simultaneous, cortex-wide dynamics of up to 1 million neurons reveal unbounded 626
scaling of dimensionality with neuron number’, Neuron, vol. 112, no. 10, pp. 1694-1709.e5, May 2024, 627
doi: 10.1016/j.neuron.2024.02.011. 628
[35] S. E. J. de Vries et al., ‘A large-scale standardized physiological survey reveals functional organization 629
of the mouse visual cortex’, Nat. Neurosci., vol. 23, no. 1, pp. 138–151, Jan. 2020, doi: 10.1038/s41593-630
019-0550-9. 631
[36] H. Lütcke, D. J. Margolis, and F. Helmchen, ‘Steady or changing? Long-term monitoring of neuronal 632
population activity’, Trends Neurosci., vol. 36, no. 7, pp. 375–384, Jul. 2013, doi: 633
10.1016/j.tins.2013.03.008. 634
[37] T. D. Marks and M. J. Goard, ‘Stimulus-dependent representational drift in primary visual cortex’, Nat. 635
Commun., vol. 12, no. 1, p. 5169, Aug. 2021, doi: 10.1038/s41467-021-25436-3. 636
[38] L. N. Driscoll, N. L. Pettit, M. Minderer, S. N. Chettih, and C. D. Harvey, ‘Dynamic Reorganization of 637
Neuronal Activity Patterns in Parietal Cortex’, Cell, vol. 170, no. 5, pp. 986-999.e16, Aug. 2017, doi: 638
10.1016/j.cell.2017.07.021. 639
[39] S. Sadeh and C. Clopath, ‘Contribution of behavioural variability to representational drift’, eLife. 640
Accessed: Jul. 08, 2024. [Online]. Available: https://elifesciences.org/articles/77907 641
[40] D. Deitch, A. Rubin, and Y. Ziv, ‘Representational drift in the mouse visual cortex’, Curr. Biol., vol. 31, 642
no. 19, pp. 4327-4339.e6, Oct. 2021, doi: 10.1016/j.cub.2021.07.062. 643
[41] E. H. van Beest et al., ‘Tracking neurons across days with high-density probes’, Oct. 16, 2023, bioRxiv. 644
doi: 10.1101/2023.10.12.562040. 645
[42] N. A. Steinmetz, P. Zatka-Haas, M. Carandini, and K. D. Harris, ‘Distributed coding of choice, action 646
and engagement across the mouse brain’, Nature, vol. 576, no. 7786, pp. 266–273, Dec. 2019, doi: 647
10.1038/s41586-019-1787-x. 648
[43] A. Lak et al., ‘Reinforcement biases subsequent perceptual decisions when confidence is low, a 649
widespread behavioral phenomenon’, eLife, vol. 9, p. e49834, Apr. 2020, doi: 10.7554/eLife.49834. 650
[44] S. Musall, M. T. Kaufman, A. L. Juavinett, S. Gluf, and A. K. Churchland, ‘Single-trial neural dynamics 651
are dominated by richly varied movements’, Nat. Neurosci., vol. 22, no. 10, pp. 1677–1686, Oct. 2019, 652
doi: 10.1038/s41593-019-0502-4. 653
[45] H. Touvron et al., ‘LLaMA: Open and Efficient Foundation Language Models’, Feb. 27, 2023, arXiv: 654
arXiv:2302.13971. doi: 10.48550/arXiv.2302.13971. 655
[46] M. Enis and M. Hopkins, ‘From LLM to NMT: Advancing Low-Resource Machine Translation with 656
Claude’, Apr. 21, 2024, arXiv: arXiv:2404.13813. doi: 10.48550/arXiv.2404.13813. 657
[47] OpenAI et al., ‘GPT-4 Technical Report’, Mar. 04, 2024, arXiv: arXiv:2303.08774. doi: 658
10.48550/arXiv.2303.08774. 659
[48] W. Truccolo, L. R. Hochberg, and J. P. Donoghue, ‘Collective dynamics in human and monkey 660
sensorimotor cortex: predicting single neuron spikes’, Nat. Neurosci., vol. 13, no. 1, pp. 105–111, Jan. 661
2010, doi: 10.1038/nn.2455. 662
[49] J. H. Siegle et al., ‘Survey of spiking in the mouse visual system reveals functional hierarchy’, Nature, 663
vol. 592, no. 7852, pp. 86–92, Apr. 2021, doi: 10.1038/s41586-020-03171-x. 664
[50] J. Hegdé, ‘Time course of visual perception: Coarse-to-fine processing and beyond’, Prog. Neurobiol., 665
vol. 84, no. 4, pp. 405–439, Apr. 2008, doi: 10.1016/j.pneurobio.2007.09.001. 666
[51] P. R. L. Parker et al., ‘A dynamic sequence of visual processing initiated by gaze shifts’, Nat. Neurosci., 667
vol. 26, no. 12, pp. 2192–2202, Dec. 2023, doi: 10.1038/s41593-023-01481-7. 668
[52] S. Vreysen, B. Zhang, Y. M. Chino, L. Arckens, and G. Van den Bergh, ‘Dynamics of spatial frequency 669
tuning in mouse visual cortex’, J. Neurophysiol., vol. 107, no. 11, pp. 2937–2949, Jun. 2012, doi: 670
10.1152/jn.00022.2012. 671
[53] J. A. Harris et al., ‘Hierarchical organization of cortical and thalamic connectivity’, Nature, vol. 575, no. 672
7781, pp. 195–202, Nov. 2019, doi: 10.1038/s41586-019-1716-z. 673
[54] Q. Wang, O. Sporns, and A. Burkhalter, ‘Network analysis of corticocortical connections reveals ventral 674
and dorsal processing streams in mouse visual cortex’, J. Neurosci. Off. J. Soc. Neurosci., vol. 32, no. 675
13, pp. 4386–4399, Mar. 2012, doi: 10.1523/JNEUROSCI.6063-11.2012. 676
[55] T. Bricken and C. Pehlevan, ‘Attention Approximates Sparse Distributed Memory’, Jan. 17, 2022, 677
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
arXiv: arXiv:2111.05498. doi: 10.48550/arXiv.2111.05498. 678
[56] F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, ‘Better & Faster Large Language 679
Models via Multi-token Prediction’, Apr. 30, 2024, arXiv: arXiv:2404.19737. doi: 680
10.48550/arXiv.2404.19737. 681
[57] R. D. Lange, S. Shivkumar, A. Chattoraj, and R. M. Haefner, ‘Bayesian encoding and decoding as 682
distinct perspectives on neural coding’, Nat. Neurosci., vol. 26, no. 12, Art. no. 12, Dec. 2023, doi: 683
10.1038/s41593-023-01458-6. 684
[58] Y. Zhang et al., ‘Towards a “universal translator” for neural dynamics at single-cell, single-spike 685
resolution’, Jul. 23, 2024, arXiv: arXiv:2407.14668. doi: 10.48550/arXiv.2407.14668. 686
[59] M. Zaheer et al., ‘Big Bird: Transformers for Longer Sequences’, Jan. 08, 2021, arXiv: 687
arXiv:2007.14062. doi: 10.48550/arXiv.2007.14062. 688
[60] O. Simeone, ‘Machine Learning for Engineers’, Higher Education from Cambridge University Press. 689
Accessed: Jul. 11, 2024. [Online]. Available: 690
https://www.cambridge.org/highereducation/books/machine-learning-for-691
engineers/7FD8622836CAFCF5EDB169E7DC8A1ED4 692
[61] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, ‘Deep Variational Information Bottleneck’, Oct. 693
23, 2019, arXiv: arXiv:1612.00410. doi: 10.48550/arXiv.1612.00410. 694
[62] H. Jang, O. Simeone, B. Gardner, and A. Gruning, ‘An Introduction to Probabilistic Spiking Neural 695
Networks: Probabilistic Models, Learning Rules, and Applications’, IEEE Signal Process. Mag., vol. 36, 696
no. 6, pp. 64–77, Nov. 2019, doi: 10.1109/MSP.2019.2935234. 697
[63] D. P. Kingma and J. Ba, ‘Adam: A Method for Stochastic Optimization’, Jan. 29, 2017, arXiv: 698
arXiv:1412.6980. doi: 10.48550/arXiv.1412.6980. 699
700
701
Methods
702
Notations 703
For any two jointly distributed random vectors 𝒂 = (𝒂) , … , 𝒂-) and 𝑏 = (𝒃) , … , 𝒃-) from time 1 to 𝑇, at 704
every time-instant 𝑡, we introduce the predictive causally conditioned (PCC) distribution of 𝒂 given 𝒃 as 705
𝑝(/!,/")(𝒂! || 𝒃!) = 𝑝W𝒂!
!2/" X𝒂!( ) , 𝒃!( /!
! Y. 706
(1) 707
This distribution captures the causal dependence of the 𝜏' future samples of RV 𝒂!
!2/" = (𝒂!, … , 𝒂!2/" ) on 708
the past 𝜏) samples of RV 𝒃!( /"
! = (𝒃!( /" , … , 𝒃!) . If 𝜏) = 𝑇 and 𝜏' = 1, this recovers the causally 709
conditioned distribution [22]. 710
711
Problem definition 712
We study the autoencoder architecture in Figure 1A with the aim of training an encoding network to produce 713
a representation 𝒛 of the neural input 𝒙 that is informative about a reference signal 𝒓. The input 𝒙 is a neural 714
signal obtained using e.g. electrophysiology or imaging techniques. The reference signal 𝒓 represents the 715
corresponding target, such as the label corresponding to the behaviour of interest; a set of behavioural 716
variables; or the input itself for unsupervised learning. 717
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
718
To simultaneously optimise encoder and decoder, we consider a data set comprising pairs (𝒙, 𝒓) of neural 719
inputs 𝒙 and reference signals 𝒓, jointly generated from an unknown population distribution 𝑝(𝒙, 𝒓). The 720
inputs are in the form of a discrete collection of vectors over time 𝒙 = (𝒙) , … , 𝒙-) with 𝒙! ∈ ℝ", and the 721
Reference
signals 𝒓 = (𝒓) , … , 𝒓-), with 𝒓! ∈ ℝ"#, define general target signals. Without loss of generality, 722
we assume that all signals are uniformly sampled with a fixed sampling rate 𝑑𝑡. 723
724
As seen in Figure 1A, the latent representation 𝒛 is obtained by processing the input signal 𝒙 through the 725
Hebbian encoder introduced in the main text and for which we provide mathematical details in the following. 726
Given 𝒛, the system is trained to predict signal 𝒓 using a decoding network, implemented as an ANN. 727
Following ideas from predictive coding, we maximise the causally conditioned distribution of 𝜏+ future 728
samples from the reference signal 𝒓!
!2/$
= (𝒓!, … , 𝒓!2/$ ) given past input spikes 𝒙! = (𝒙) , … , 𝒙!), averaged 729
over time-steps 𝑡 = 1, … , 𝑇 as 730
1
𝑇 3 log 𝑝(𝒓𝑡
𝑡+𝜏𝑓 ||𝒙𝑡)
%
&'(
= 1
𝑇 3 log 𝛦𝒛! 𝑝(𝒓&
&*+"
|𝒛&, 𝒙&, 𝒓&, ( ) .
%
&'(
731
(2) 732
Note that the evaluation of objective (2) is intractable for problems of practical size (see, e.g., [60]) and, as 733
we detail in the following sections, we optimise over the model weights by considering a variational 734
approximation. 735
736
Encoding Network 737
Given the architecture proposed in Figure 1A, the encoder implements a causal stochastic mapping between 738
input 𝒙 and latent signal 𝒛 that depends on a vector of parameters 𝒘7 as 739
𝑝𝒘& (𝒛||𝒙) = Z 𝑝𝒘& (𝒛! ||𝒙!)
-
!9)
, 740
where the distribution 𝑝𝒘𝒆 (𝒛𝒕||𝒙𝒕) is obtained through the reparametrisation trick as 741
𝑝𝒘𝒆 (𝒛𝒕||𝒙) ∼ 𝒩 \𝝁W𝒇𝒘& (𝒙!)Y, 𝝈'W𝒇𝒘& (𝒙!)Y`, 742
with 𝒇𝒘𝒆 (⋅) being the output of the encoding network. Samples are obtained as 𝒛 = 𝝁W𝒇𝒘& (𝒙)Y +743
𝝈'W𝒇𝒘& (𝒙)Y ⊙ 𝛜 with 𝛜 ∼ 𝒩(0, 1). The latent signal 𝒛 is fed to a decoder, whose role is to predict the 744
Reference
signal 𝒓. Both blocks 𝜇(⋅) and 𝜎(⋅) are implemented as a single linear layer. 745
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
746
Maximum likelihood optimisation 747
In order to address the maximisation of likelihood (2), we adopt a variational formulation that relies on the 748
Introduction
of a decoding network that implements at every time-step 𝑡 = 1, … , 𝑇 the PCC distribution 749
𝑞-#
(+$,+")
(𝒓& || 𝒛&) = 𝑞𝒘# @𝒓&
& *+"
|𝒓& , ( , 𝒛&, +$
& A 750
(3) 751
between latent representation 𝒛 and reference signal 𝒓 [23], [61]. The decoder (3) is causal, with memory 752
given by integer 𝜏* > 1 and is parametrised by a vector 𝒘;. 753
In practice, we implement the decoder as a model 𝑞𝒘( \𝒓!
! 2/$
g𝒓! – ) , 𝒛!( /)
! ` : = 𝑞𝒘( (𝒓!
!2/$
| 𝒛!( /)
! ), where 754
𝑞𝒘( (𝒓!
!2/$
| 𝒛!( /)
! ) is implemented by the output layer of an ANN with inputs given by a window 𝒛!( /)
! of 755
samples from the latent representation 𝒛. Alternatively, one can use an RNN or a probabilistic SNN [62] to 756
directly model the kernel 𝑞𝒘( \𝒓!
! 2/$
g𝒓! – ) , 𝒛!( /)
! `. 757
With such a network, we use a standard variational inequality [60] to lower bound the decoder log-likelihood 758
at each step 𝑡 = 1, … , 𝑇 759
log 𝑝W𝒓=
/$
X|𝒙=Y ≥ 𝔼*𝒘& >𝒛+,-)
+ | 𝒙+B[log 𝑞𝒘(
(/),/$)
(𝒓! ||𝒛!)] + KL(𝑝𝒘& (𝒛! | 𝒙!) || 𝑝(𝒛)). 760
Averaging over time-steps, we obtain the variational predictive likelihood 761
ℒ𝒘(𝒙, 𝒓) = 𝔼*𝒘&(𝒛||𝒙) s1
𝑇 t log 𝑞𝒘(
(/),/$)
(𝒓!||𝒛! )
-
!9)
u− 1
𝑇 KL(𝑝𝒘& (𝒛||𝒓) || 𝑝(𝒛)) 762
(4) 763
where 𝒘 = (𝒘7, 𝒘𝒅), that we maximise via stochastic gradient descent (SGD) and the Adam optimiser [63] 764
using Monte Carlo gradients. 765
766
The gradient with respect to the encoder and decoder weights can be obtained via the stochastic gradient 767
variational Bayes (SGVB) technique [20] as 768
∇𝒘ℒ𝒘(𝒙, 𝒓) = 1
𝑇 t 𝔼*(𝒙+,𝒓𝒕/𝝉𝒇)𝔼*(𝝐) w∇𝒘log 𝑞𝒘( \𝒓𝒕
𝒕2𝝉𝒇
g𝝁W𝒇𝒘𝒆 (𝒙𝒕)Y + 𝝈'W𝒇𝒘𝒆 (𝒙𝒕)Y ⊙ 𝛜 `x
-
!9)
. 769
(5) 770
Unbiased estimates of the gradients can be evaluated by using mini-batches ℬ ⊆ 𝒟 from the data set 𝒟, 771
along with random samples 𝛜(G) ∼ 𝒩(0, 𝐼"2 ) for 1 ≤ 𝑙 ≤ 𝐿 with 𝐿 ≥ 1. The gradients ∇𝒘 w.r.t. the 772
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
encoder and decoder weights are computed using backpropagation and automatic differentiation tools. 773
774
Online optimisation 775
The computation of gradients (5) is expensive as it requires to apply the backpropagation-through-time 776
algorithm, which scales quadratically in terms of memory consumption with the number of time-steps. To 777
alleviate memory issues during optimisation when the recording duration is large, we propose to use the 778
following approximation to perform SGVB updates at every time-step 779
∇𝒘ℒ𝒘 \𝒙!, 𝒓𝒕
𝒕2𝝉𝒇
` ≈ 𝔼
*H𝒙+, 𝒓𝒕
𝒕/𝝉𝒇 I
𝔼*(𝝐) w∇𝒘log 𝑞𝒘( \𝒓𝒕
𝒕2𝝉𝒇
g𝝁W𝒇𝒘𝒆 (𝒙!)Y + 𝝈'W𝒇𝒘𝒆(𝒙!)Y ⊙ 𝛜 `x, 780
which ignores the contribution of derivatives from previous time-steps 𝑡J < 𝑡. 781
782
Hebbian Self-Attention 783
We propose to apply online STDP at every time-step 𝑡 to compute a matrix 𝑨! ∈ ℝ"3× "3 in which each 784
element 𝑎KL,! defines an attention coefficient between sequences 𝑖 and 𝑗. We consider an online STDP kernel, 785
whereby coefficients for each synapse (𝑖 → 𝑗)) MK,LM" are obtained by maintaining over time a pair of pre- and 786
post-synaptic eligibility traces 𝑎KL,!
NOP and 𝑎KL,!
NQR= where 𝑡 ∈ {1, … , 𝑇} and N is the number of recorded neurons, 787
Without loss of generality, we assume 𝑇 to be the temporal horizon of the task. When neurons on either side 788
of the synapse fire, the corresponding traces are updated by an amount determined by learnable weights 𝒘KL,!
NOP 789
and 𝒘KL,!
NQR=, and exponentially decay in the absence of a spike by a value determined by scalars 𝜏KL
NOP and 𝜏KL
NQR=. 790
The STDP coefficient 𝑎KL,! is potentiated (resp. decreased) by the value of the pre- (resp. post-) synaptic trace 791
when the post- (resp. pre-) synaptic neuron fires. Mathematically, we obtain the coefficient 𝑎KL,! as 792
𝑎KL,! ∶= 𝑎KL,!( ) + 𝑠K,! ⋅ 𝑎KL,!
NQR= − 𝑠L,! ⋅ 𝑎KL,!
NOP, 793
where 𝑠K,! ∈ {0, 1} denotes the state of neuron i at time t, and 𝑎KL,!
NOP and 𝑎KL,!
NQR= are pre- and post-synaptic traces 794
defined as 795
𝑎KL,!
NOP ∶= 𝑎KL,!( )
NOP ⋅ 1 − 𝑑𝑡
𝜏KL
NOP + 𝑤KL
NOP ⋅ 𝑑𝑡 ⋅ 𝑠K,! 796
and 797
798
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
𝑎KL,!
NQR= ∶= 𝑎KL,!( )
NQR= ⋅ 1 − 𝑑𝑡
𝜏KL
NQR= + 𝑤KL
NQR= ⋅ 𝑑𝑡 ⋅ 𝑠L,!. 799
800
All values 𝜏KL
NOP, 𝜏KL
NQR=, 𝑤KL
NOP and 𝑤KL
NQR= are learnable parameters for each synapse 𝑖 → 𝑗. 801
802
Finally, the output of the attention layer is obtained as 803
HebbianAttention(𝑨!, 𝑽) = 𝑨! . 𝑽, 804
where 𝑽 ∈ ℝ"3 × ;4 is a learnable weight matrix defined in analogy to values in conventional dot-product 805
attention [17]. 806
807
The attention mechanism can further be expanded to account for several versions of the parameter vectors 808
𝚻NOP,(S) ≔ \𝜏KL
NOP,(T)
`
KL
, 𝚻NQR=,(S) , 𝑾NOP,(S), 𝑾NQR=,(S), and 𝑽(S) for 𝑘 = 1, … , 𝐾, with 𝐾 being the number of 809
attention heads. The resulting multi-headed attention module allows the model to jointly attend to 810
information from different representation subspaces at different positions [17], and can be obtained as 811
MultiHeadedHebbianAttention! 812
= 𝑾𝒐
- Concat(HebbianAttention(𝑨!
() ), 𝑽() )), … , HebbianAttention(𝑨!
(V), 𝑽(V)), 813
where 𝑾𝒐 ∈ ℝV;4× V;4 is a learnable parameter and Concat(⋅) denotes the vertical concatenation of 814
matrices. 815
816
Calcium recordings 817
When considering calcium recording data, we propose to directly use the calcium traces of each neuron as 818
eligibility traces, which removes the heavy processing that comes with spike deconvolution. Mathematically, 819
given inputs 𝒙, we have 820
𝑎KL,!
NOP ∶= 𝒙K,!, 821
𝑎KL,!
NQR= ∶= 𝒙L ,! 822
and 823
𝑎KL,! ∶= 𝑎KL,!( ) + 𝑎KL,!
NQR= − 𝑎KL,!
NOP. 824
825
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
Mouse MEC dataset 826
Briefly, activity from MEC cells of head-fixed mice allowed to run on a rotating wheel in darkness was 827
recorded using either two-photon calcium imaging or Neuropixels probes [1]. We consider a single example 828
recording provided by the authors at https://github.com/soledadgcogno/Ultraslow-oscillatory-829
sequences/tree/main, which included 464 neurons recorded over 3600.51 seconds at a rate of 7.73 Hz. 830
Following the authors’ methodology [31], we downsampled the recording by a factor of 4 down to 1.93 Hz. 831
For the prediction task shown in Figure 2, we trained both SPARKS and another autoencoder with the same 832
architecture but implementing the alternative symmetric Hebbian rule on four segments lasting 500 seconds 833
each, the first starting at 600 seconds in the recording, and leaving one out for testing. Results are evaluated 834
numerically through the binary cross-entropy 835
BCE = t 𝑟W,! ⋅ log 𝑞𝒘( (𝑟! | 𝒛!( /)
! ) + (1 − 𝑟W,!) ⋅ log(1 − 𝑞𝒘( (𝑟! | 𝒛!( /)
! ))
W,!
. 836
We further evaluated the quality of the reconstructions as per the oscillatory score proposed in [31]. We show 837
in Figure S 2 the PSD of reconstructions, evaluated using Welch’s method with Hamming windows of 66 s 838
(128 bins of 516 ms in each window) and 50% of overlap between consecutive windows. The horizontal lines 839
correspond to the threshold used to evaluate the presence of oscillations [31], which amounts to nine times 840
the average value of the tail of the PSD after the first peak. 841
Next, we trained SPARKS on the whole recording, starting from 600 seconds in the session as we could not 842
find oscillatory sequences prior to that. From the 2-dimensional latent embeddings hence obtained, we 843
recovered a phase signal by computing at every time-step 𝑡 the angle between a segment joining the current 844
embedding 𝒛𝒕 to the centre of the ring topology, and another joining the embedding 𝒛! to a fixed direction 845
(here, the horizontal, Figure S 2). The phase signal we hence obtain is shown in Figure S 2. 846
We sorted neurons by ordering them according to their correlation to the phase signal. 847
Finally, we evaluate the quality of each sorting by population decoding of the sorted signals (Figure S 2) and 848
plotting their PSDs (Figure S 2). We obtain the population decoding 𝒙 of spiking signal 𝒙 as 849
𝑥! = 1
𝑁 t 𝒙K,!.
K
850
Similarly to the oscillatory score proposed in [31], we evaluate the quality of each sorting by measuring the 851
height of the peak above 0 in the PSD. 852
853
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
Macaque reaching task dataset 854
The dataset corresponds to a single session extracted from [32]. Briefly, electrophysiological recordings were 855
taken from Area 2 of the somatosensory cortex (S1) of a rhesus macaque during an eight-direction center-out 856
reaching task. We consider a session obtained from the Neural Latents Benchmark [33] to train our model. 857
We use 1 ms bin times, and consider recordings from −100 ms to 500 ms after movement onset. Similarly to 858
[33], we consider a 40 ms lag between neural data and movement to align trials. We train SPARKS by 859
considering an encoder with no conventional attention layers, and a 64-dimensional embedding. The decoder 860
is chosen as an MLP with layer dimensions 𝑑 × 𝜏* −
; × /)2 ' × /$
' − 𝑁X × 𝜏+, where 𝑁X corresponds to the 861
dimension of the reference signal 𝒓. Models were trained for 500 epochs. Results presented for unsupervised 862
learning were obtained by choosing 𝜏+ = 10. 863
The alternative model using conventional scaled dot-product attention operates using firing rates smoothed 864
over a 200 ms window, which was chosen after hyperparameter search. 865
Results
from [16] were shared by the authors at https://github.com/AdaptiveMotorControlLab/cebra-figures. 866
867
Allen visual coding dataset 868
The Allen visual coding dataset comprises two-photon calcium imaging and Neuropixels data recorded from 869
the primary visual cortex of mice during presentation of video scenes, and is available to download from the 870
Allen Institute [35]. We consider units sampled at random from all sessions in the Allen visual coding 871
dataset, and stack the raw spike trains to perform movie reconstruction, frame prediction or unsupervised 872
learning. 873
Each presentation of the excerpt from the movie Touch of Evil, performed at a rate of 30 kHz and lasting 874
30 s, were downsampled with a bin size of 6 ms to speed up training. Calcium recordings were kept to their 875
initial rate of 30 Hz, matching the film frame rate. For frame reconstruction, movie frames 𝐹 ∈ ℝYZZ × [ × , 876
were spatially downsampled to obtain film tensor 𝐹 ∈ ℝYZZ × [5× , 5
to reduce the computational load of the 877
technique. 878
The 1-second frame prediction accuracy was computed as 879
1
𝑇 t 1(argmaxW𝑞(𝑟! |𝑧!, 𝑥!)Y ∈ [𝑡 − 6, 𝑡 + 6]),
-
!9)
880
Where 1(⋅) is the indicator function that outputs 1 if its argument is true, and 0 otherwise; and the argmax 881
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
function outputs the largest index of its vector argument. 882
Distances between movie frames were computed as 𝑑𝑖𝑠𝑡W𝑓K, 𝑓LY = 1 − 𝑐𝑜𝑟𝑟(𝑓K, 𝑓L) for every pair of frames 883
W𝑓K, 𝑓L Y) MK,LMYZZ where 𝑐𝑜𝑟𝑟(⋅,⋅) is the 2-dimensional correlation normalised to [0, 1]. 884
Distances between the neural activity was computed from the tensor 𝑆 ∈ ℝ)Z × YZZ × )ZZZ × \\ of spikes over 885
the 10 movie repetitions, for the 900 frames and subsampled at 1kHz, corresponding to 33 time-steps per 886
frame. 887
The average activity per frame is computed as 𝑆+ = ∑ 𝑆W+S!W,! . Instantaneous distances between frames are 888
then computed as the 𝐿2 vector distance 𝑑𝑖𝑠𝑡 \𝑆+6
, 𝑆+]
` = ©∑ \𝑆+6S − 𝑆+7S `
'
S and normalised to [0, 1]. 889
890
Functional Hierarchy of Mouse Visual Cortical Areas 891
We consider static grating presentations from the Allen visual coding dataset. Gratings were presented for 892
250 ms in a random order, with spatial frequencies equal to 0.02, 0.04, 0.08, 0.16 and 0.32 cycles per 893
degree. We consider all presentations with a fixed orientation of 150 degrees, and all spatial phases, resulting 894
in 30 training examples and 10 test examples per spatial frequency. Examples were downsampled with a bin 895
size of 1 ms. For each repetition, we used 700 single unit recordings from each visual cortical area (VISp, 896
VISpm, VISam, VISl, VISal and VISrl), which corresponds to the lowest number of neurons collected in a 897
single area across recordings. 898
We train SPARKS via supervised learning to predict the spatial frequency of the gratings at every time-step, 899
i.e., for all 𝑡 = 1, … , 250, we set 𝑟! ∈ {0.02, 0.04, 0.08, 0.16, 0.32}. Error at time 𝑡 for spatial frequency 𝑓 900
and cortical area 𝑖 was computed as the L1 distance 901
𝑒K,!
+ = |𝑓 − 𝑞𝒘( (𝑟! | 𝒛!( /)
! )|. 902
Projections from cortical area 𝑖 to cortical area 𝑗 were computed as 903
projK→L = t 1W𝑒K,!
+ < 𝑒L,!
+ Y ⋅ |𝑒K,!
+ − 𝑒L,!
+ |
+,!
. 904
Code availability 905
Code: https://www.github.com/FrancisCrickInstitute/sparks (coming soon). 906
907
908
909
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
Acknowledgements
910
Primary funding for this project was provided by the Biotechnology and Biological Sciences Research 911
Council (BBSRC, award ref. BB/X013472/1) and the Francis Crick Institute (award ref. CC2118). The 912
authors wish to thank O. Simeone, A. Schaefer, S. Tootoonian and M. Kollo for their discussions and 913
comments on the manuscript. 914
915
Author contributions 916
F.I. and S.S. co-supervised the work. N.S. developed the algorithm, wrote the code, performed the 917
simulations and wrote the manuscript. N.G. refined the algorithm for calcium data. F.I. and S.S. revised the 918
manuscript. 919
920
Materials
& Correspondence 921
Requests should be addressed to N. Skatchkovsky. 922
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint
923 Figure S 1(A) Detailed illustration of the computation of the Hebbian attention coefficients via STDP. (B) Schematic of 924 the conventional dot-product attention layer, adapted from [17]. 925 926
!!",%&'(=0!!",%&)*+=0!!",%=0!!",,&'(=!!",%&'(⋅1−,-! +0⋅)!"&'( !!",,&)*+=!!",%&)*+⋅1−,-! +*⋅)!"&)*+ !!",,=!!",%−*⋅!!",,&'(+0⋅!!",,&)*+ !!",.&'(=0+*⋅)!"&'( !!",.&)*+=)!"&)*+⋅1−,-! +0 !!",.=0−0⋅!!",.&'(+*⋅!!",.&)*+ !!",/&'(=)!"&'(⋅1−,-! !!",/&)*+=)!"&)*+⋅1−,-! .+*⋅)!"&)*+ !!",/=!!",.−*⋅!!",/&'(+0⋅!!",/&)*+ ‘pre-syn.’ neuron +,-=1)!"&'(&&',4*+,&&',4*-./‘post-syn.’ neuron j--&&',4 -)!"&)*+AInputsInput embedding+%+%3%4MatMulMatrix multiplicationScale &SoftMaxPositional encodingBSTDP coefficient
927 Figure S 2 (A) Power spectral density (PSD) of the phase of the oscillation of the signal, as defined in [31]. (B) Latent 928 embeddings obtained with the alternative symmetric rule. (C) Latent embeddings obtained with non-learnable parameters 929 𝜏34!56,𝜏34!7$&,𝑤34!56 and 𝑤34!7$& for all synapses 𝑖→𝑗. (D) Illustration of the angle used to recover the phase of the 930 oscillations from the latent embeddings. (E) Phase signal recovered from the latent embeddings. (F) Oscillatory signals 931 recovered by population decoding of the spike trains with different sortings of the neurons. (G) Corresponding PSDs. 932 933 934 935 936
937 Figure S 3 (A) Example latent embeddings obtained from training models to decode the position of the animal’s hand. 938 From left to right: Alternative model with conventional dot-product attention, VAE with a traditional ANN encoder, VAE 939 with a recurrent neural network (RNN) encoder. (B) Comparison of 𝑅# scores for prediction of hand position. Orange 940 lines represent the median over 5 runs, and each dot an individual run. From left to right: SPARKS with a 3-dimensional 941 latent space, alternative model with conventional dot-product attention for a 3-dimensional latent space, CEBRA with a 942 16-dimensional latent space, VAE with a traditional ANN encoder for a 3-dimensional latent space, VAE with a recurrent 943 neural network (RNN) encoder for a 3-dimensional latent space. (D) Prediction 𝑅# scores for unsupervised learning 944 across prediction window length 𝜏". Shaded areas represent standard deviation over 5 individual runs. (E) Similarity 𝑅# 945 scores for supervised learning across different reference signals. (F) Consistency vs. prediction 𝑅# scores for the same 946 target variables. Each value represents the mean over 5 individual runs. (G) Consistency of latent embeddings for 947 supervised learning across hyperparameter values. r: Pearson correlation coefficient. p: p-value, t-test. 948 949
950 Figure S 4 (A) 1s decoding accuracy as a function of number of neurons in a session. Error bars represent standard 951 deviation across 5 repetitions. r: Pearson correlation coefficient. (B) Latent embeddings averaged across examples from 952 blocks 1 and 2. (C) 1s decoding accuracy as a function of number of randomly sampled neurons. Error bars represent 953 standard deviation across 5 repetitions. (D) Normalised Euclidian distance between the average firing rate per frame in 954 block 1 across 1,000 neurons. (E) Similarity 𝑅# scores between the latent embeddings obtained for different example 955 sessions, averaged over 5 repetitions and 10 movie presentations from block 1. (F) Evolution of the 1s decoding accuracy 956 during fine-tuning on a single session, normalised to the decoding accuracy of a model trained on this session only. (G) 957 Latent embeddings averaged across examples from blocks 1 for 5 example sessions. 958
AB
C
Block 1 framesBlock 2D
Session 0Session 1Session 2Session 3Session 4
E
GF
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.