A Biologically Inspired Attention Model for Neural Signal Analysis

doi:10.1101/2024.08.13.607787

A Biologically Inspired Attention Model for Neural Signal Analysis

2024 · doi:10.1101/2024.08.13.607787

preprint OA: gold CC-BY-NC-ND-4.0

📄 Open PDF Full text JSON View at publisher

Full text 98,238 characters · extracted from oa-pdf · 13 sections · click to expand

Abstract

11 12 Understanding how the brain represents sensory information and triggers behavioural responses is a 13 fundamental goal in neuroscience. Recent advances in neuronal recording techniques aim to progress towards 14 this milestone, yet the resulting high dimensional responses are challenging to interpret and link to relevant 15 variables. In this work, we introduce SPARKS, a model capable of generating low dimensional latent 16 representations of high dimensional neural recordings. SPARKS adapts the self-attention mechanism of large 17 language models to extract information from the timing of single spikes and the sequence in which neurons 18 fire using Hebbian learning. Trained with a criterion inspired by predictive coding to enforce temporal 19 coherence, our model produces interpretable embeddings that are robust across animals and sessions. 20 Behavioural recordings can be used to inform the latent representations of the neural data, and we 21 demonstrate state-of-the-art predictive capabilities across diverse electrophysiology and calcium imaging 22 datasets from the motor, visual and entorhinal cortices. We also show how SPARKS can be applied to large 23 neuronal networks by revealing the temporal evolution of visual information encoding across the hierarchy of 24 the visual cortex. Overall, by integrating biological mechanisms into a machine learning model, we provide a 25 powerful tool to study large-scale network dynamics. SPARKS’ capacity to generalize across animals and 26 behavioural states suggests it is capable of estimating the internal latent generative model of the world in 27 animals, paving the way towards a foundation model for neuroscience. 28 29

Introduction

30 31 Recent advancements in both imaging techniques and electrophysiology have enabled neuroscientists to 32 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint capture neuronal activity at an unprecedented scale [1]. Experimental scientists now benefit from the 33 capacity to simultaneously record hundreds to thousands, and even a million neurons over extended periods 34 [2], [3], [4], [5]. Interpreting this neural activity and relating it to sensory, motor, and cognitive processes is a 35 key goal of neuroscience [6], yet this task remains inherently challenging for a number of reasons. 36 First of all, neural recordings are high dimensional by nature, both spatially – because of the number of 37 neurons recorded, and temporally – because of high temporal resolution or long recording times, making 38 them intractable to us. To circumvent this issue, previous studies often relied on single-neuron or average 39 population analysis, and averaged spike trains to consider firing rates [7], [8]. Whilst convenient for analysis, 40 doing so masks the information found at the level of single neurons and single spikes, and relies on arbitrary 41 choices of bin width for averaging. This results in loss of information which can be crucial for analysing 42 neuronal recordings [9], for instance to understand physiological processes with multiple relevant timescales. 43 44 Unfortunately, conventional signal processing techniques are ill-suited to operate on sparse, discrete, and 45 non-uniformly sampled spike trains, making them inappropriate for the analysis of neural data. The 46 development of effective mathematical frameworks to analyse such data is a nascent area of research [10]. 47 Additionally, a substantial portion of the variance in neural recordings can typically be attributed to 48 behavioural parameters like uninstructed movements, making it harder to find relevant information pertaining 49 to e.g. sensory processing [11]. Lastly, there is substantial variability across trials, sessions and animals 50 reported by numerous studies when trying to link neural activity to relevant metrics [12], [13]. Altogether, 51 these obstacles render the task of making sense of the high dimensional neuronal responses very challenging. 52 53 Focusing on low-dimensional representations of neural data has shown promise, as previous research 54 indicates that information encoded by neural activity in several brain regions and across behavioural tasks 55 can often be represented by a few dimensions [8], [14], [15]. Several machine learning (ML) techniques have 56 been employed to derive meaningful and exploitable low-dimensional representations of high-dimensional 57 neural data [16]. However, these meth58 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint 59 ods also often ignore precise spike timing and the intrinsic sequential structure of neural data. This leads to 60 loss of critical temporal information and requires extensive pre-processing and manual curation, limiting 61 standardization and reproducibility [7]. 62 63 To address these challenges, we propose to leverage recent developments in the processing of sequential data 64 via machine learning models, notably with the Transformer model powered by self-attention mechanisms 65 [17]. Self-attention consists in computing relationships between parts of an input signal in order for the model 66 to determine which are important for the task. Following their success in natural language processing (NLP) 67 [3] and multimodal data processing [18], Transformers have been recently applied to neural recordings for 68 behaviour prediction, incorporating precise spike times and unit identities [19]. 69 70 In this work, we propose to use the capabilities of Transformers by designing a self-attention mechanism that 71 leverages single spikes as tokens, that is, individual units in the signal. The Hebbian attention layer 72 introduced in this work relies on computing relationships between spike trains using the biologically inspired 73 mechanism of Hebbian learning. Our main focus is to obtain expressive and robust latent representations to 74 analyse neural recordings across sessions and animals, with the added benefit of state-of-the-art predictive 75 power. Consequently, we rely on a variational autoencoder (VAE) model [20] to produce the desired 76 Figure 1 Overview of SPARKS. (A) Architecture of the proposed variational autoencoder. Blocks 𝝁(⋅) and 𝝈(⋅) represent the mean and standard deviation of the latents distribution. (B) Illustration of the predictive conditional distribution: at every time-step, a window 𝜏! of past latent embeddings is used to predict the future 𝜏" samples from the

Reference

signal (C) Detailed view of the Hebbian attention layer. (D) Schematic of the attention blocks, adapted from [17]. (E) Encoder architecture. .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint embeddings, rather than a Transformer architecture (Figure 1A). Note that conventional, masked VAEs were 77 recently proposed to model neural distributions [21]. Our approach differs in two ways: firstly, instead of 78 fully connected and conventional layers, the encoder consists of attention blocks, the first of which is the 79 proposed Hebbian attention block. Secondly, we introduce a novel criterion to train the system: taking 80 inspiration from predictive coding, the main objective of the approach is to maximize the probability for the 81 system to predict the correct output for a number of time-steps in the future given the current latent 82 embeddings (Figure 1B, Methods). Importantly, a single model can be used to directly compare the latent 83 embeddings obtained from recordings across sessions or animals within the same latent space. 84 85 In this work, we have developed a Sequential Predictive Autoencoder for the Representation of spiKing 86 Signals (SPARKS), and demonstrate that the approach offers several key advantages: training on raw spiking 87 signals or calcium imaging data with minimal preprocessing; directly interpretable low-dimensional latent 88 embeddings; state-of-the-art predictive power from few neurons; unsupervised exploration of the data; and 89 robust encodings across repetitions, sessions, animals, and brain areas. We showcase SPARKS’ capabilities 90 using diverse datasets, highlighting its potential to uncover the underlying topology of neuronal 91 representations and its generalisation capabilities across different sessions and animals. We attribute the wide 92 abilities of SPARKS to the use of a Transformer encoder, equipped with our biologically inspired Hebbian 93 attention layer, and training via the proposed objective, which in combination allows us to sequentially 94 extract information at the single spike level with unparalleled precision. 95 96 Obtaining Latent Embeddings using a Variational Autoencoder 97 with Hebbian Attention 98 To obtain consistent latent embeddings that capture most of the variance from high dimensional neuronal 99 responses, SPARKS combines a variational autoencoder with the Hebbian attention layer and predictive 100 learning rule introduced in this work ([10], Figure 1A-B). Following the seminal work [17], the encoder 101 comprises several attention blocks, each composed of an attention layer (Figure 1C), followed by a fully 102 connected feedforward network with residual connections and batch normalisation (Figure 1D-E). The first 103 attention block implements the Hebbian self-attention mechanism that we detail in the next section, with the 104 following blocks implementing conventional dot-product attention ([17], Figure S 1B). To obtain the latent 105 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint representation of the input data, the output of the last attention block is flattened and projected into the 106 desired latent dimension 𝑑. Throughout, the decoder is a fully connected feedforward neural network, tasked 107 to predict a desired reference signal. The reference signal can be chosen as a general supervisory signal such 108 as behavioural data or sensory inputs to perform supervised learning, or as the input spike train itself to 109 perform unsupervised learning. As we will demonstrate, both approaches can be used in complement of each 110 other. When trained via unsupervised learning, our model already captures most of the variance encoded in 111 the neural data, which allows one to freely explore the data using the low-dimensional representations 112 produced by the encoder. This is particularly interesting to investigate how complex combinations of external 113 and internal parameters are encoded in the data. Such an analysis can be completed by training the model via 114 supervised learning to understand the encoding of external parameters that were recorded during the 115 experiments. 116 117 We train the proposed autoencoder through a novel objective which is obtained by extending causally 118 conditioned distributions [22], [23]. Causally conditioned distributions model the probability of a random 119 variable (RV) given its own past and that of other RVs of interest: in this context, we maximise the 120 probability to output the desired reference signal given the latent embedding produced by the encoder. More 121 specifically, we introduce a novel criterion based on the predictive causally conditioned (PCC) distribution, 122 which consists in maximising the likelihood of future samples from the reference signal given a window of 123 past samples from the latent space (Figure 1B). The reference signal can be chosen either as a relevant 124 timeseries such as behavioural data, or as the input neural signal (Methods). The use of the PCC is inspired 125 by the predictive coding theory, namely that the encodings produced by the generative model in the brain 126 constantly aim at predicting future sensory inputs. 127 128 Applying Self-Attention to Neural Data 129 Self-attention is a key component of the Transformer architecture [17] which consists in computing a 130 similarity measure between different parts of a given input signal to extract contextual information. For 131 instance, in NLP, self-attention measures, for each word in a sentence, the relative importance of the other 132 words to obtain semantic information. How to apply this technique to neural data remains an open question. 133 A growing corpus of works have directly applied traditional attention mechanisms to spiking signals [24], 134 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint [25], [26], but they typically relied on the implicit assumption that spike trains were obtained by the encoding 135 of natural, real-valued signals into spikes. Such signals possess specific spatial and/or temporal structures, but 136 it is unclear if such assumptions stand for neural recordings. 137 138 To overcome this challenge, we propose a self-attention mechanism for neural data based on a biologically 139 inspired principle: we make use of Hebbian mechanisms, which have long been suggested as one of the key 140 mechanisms underlying learning in biological neural networks [27], [28]. This allows us to apply attention 141 spatially across neurons at every time-instant in the recording, with temporality maintained in the model via 142 the computation of eligibility traces for every pair of neurons, that act as built-in temporal encodings (Figure 143 1C, Figure S 1A). This is unlike conventional dot-product attention, which relies on various temporal 144 embedding techniques to introduce temporality across tokens presented simultaneously to the model ([17], 145 Figure S 1B). 146 147 More specifically, we compute spatial attention between neural sequences as the synaptic coefficients 148 obtained at every time instant via spike-time dependent plasticity (STDP). These coefficients are computed 149 between each pair of neurons from the recording of interest, regardless of prior knowledge of connectivity 150 between neurons. The computation of attention coefficients is carried out by maintaining eligibility traces for 151 each pair of neurons, which are calculated using weighted averages of the neurons’ spiking history: they are 152 potentiated when either neuron spikes, and decay exponentially otherwise (Figure 1B). The time constant of 153 the exponential decay and the values by which the pre- and post-synaptic traces are updated at spike times 154 could be measured [29] but are a priori unknown, and we propose to learn these values via backpropagation 155 individually for each pair of neurons. Our method can also be readily adapted to operate using calcium traces 156 obtained through imaging (Methods). The attention coefficients obtained via STDP are stored in a matrix 157 𝑨! ∈ ℝ"×", which can be likened to the dot-product between keys and queries in the conventional dot-158 product attention [17] (Figure 1C, Figure S 1B). These values are then projected into a matrix 𝑾$ of 159 dimension 𝑁 × 𝑑% where 𝑑% is chosen such that 𝑑% ≪ 𝑁 to reduce computational complexity when the 160 number of neurons is high. 161 162 As we will demonstrate throughout, the implementation of an attention mechanism via causal STDP enables 163 the model to capture sequential information. This is key to analysing neural data from brain regions where 164 information is encoded into the precise order of spikes, such as in the entorhinal-hippocampal system during 165 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint spatial navigation [30], as highlighted in the following section. 166 STDP as an Attention Mechanism Captures Sequential Information 167 168 Figure 2 SPARKS reveals oscillatory sequences from MEC data. (A) Illustration of calcium recording in the MEC of 169 head-fixed mice freely running on a wheel, reproduced from [31]. (B) Raster plot of thresholded calcium signals of all 170 cells in an example session (bin size = 129 ms, 𝑛 = 484 cells). Cells were sorted according to their correlation values 171 with one arbitrary cell, in a descending manner. Minutes-long oscillatory sequences appear after sorting. (C) Snippet of 172 pre- and post- synaptic firing patterns (top) and corresponding coefficients obtained after application of STDP (middle), 173 and an alternative symmetric rule (bottom). (D) Raster plot for left-out data (left) and corresponding predicted spike 174 probabilities after training with SPARKS (centre), and the benchmark symmetric rule (right). (E) Two-dimensional 175 embeddings obtained from SPARKS for the whole session after unsupervised training. (F) Oscillatory sequences after 176 ordering of the neuron obtained via SPARKS (top), correlation (middle), and PCA sorting (bottom), and corresponding 177 power spectral densities (PSDs, right). 178 We begin by motivating the use of STDP over alternative Hebbian rules to capture sequential information in 179 neural data. To do so, we analysed recently published data from the medial entorhinal cortex (MEC), where 180 the authors uncovered oscillatory sequences that are propagated through the MEC in the scale of tens of 181 seconds to minutes (Figure 2A-B, [31]). In such sequences, the information is encoded both spatially and 182 temporally, and analysing them requires keeping track of which neurons fired and in what order. We 183 evaluated the relevance of the proposed attention mechanism by comparing it to an alternative symmetric 184 rule, whereby attention coefficients are potentiated regardless of the order in which pre- and post-synaptic 185 neurons fired (Figure 2C). 186 187 We considered a single session recorded with calcium imaging, that lasted roughly 3,600 seconds, and 188 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint trained both SPARKS and the alternative rule via unsupervised learning using the first 3,100 seconds of the 189 recording, to test them on the remaining 500 seconds. We found that both rules obtained a similar test binary 190 cross-entropy (BCE) score, but the symmetric rule was unable to recover the oscillatory sequences from the 191 spiking sequence, as measured by the oscillatory score proposed in [31] (Figure 2D, Figure S 2A, Methods). 192 Furthermore, we found that the 2-dimensional embedding of this data produced by SPARKS follow circular 193 trajectories, hence recovering the ring-shaped topology uncovered in [31] (Figure 2E). This was obtained 194 regardless of whether we ordered the neurons prior to training the model, which is necessary to highlight the 195 oscillatory sequences from this data. We were not able to recover such structure from the embeddings 196 produced by the model with the symmetric rule (Figure S 2B). Interestingly, we also found that if the time 197 constant of the exponential decay 𝜏& and the values by which the pre- and post-synaptic traces are updated at 198 spike times were not learnable, the embeddings produced did not allow us to recover oscillations (Figure S 199 2C). 200 201 From the latent embeddings produced by SPARKS, we were able to retrieve the phase of the oscillations, 202 which allowed us to order neurons according to their correlation to the phase signal (Figure S 2D, Methods), 203 and to obtain clear oscillatory sequences (Figure 2F). To assess the quality of the resulting sorting [31], we 204 calculated the height of the largest peak in the power spectral density (PSD) of the 1-dimensional signal 205 obtained by population decoding of the spike train (Figure 2F, Figure S 2F-G, Methods). SPARKS’ sorting 206 had a peak power of 0.05 mV'. Hz( ) , outperforming the sorting obtained through correlation (0.035 207 mV'. Hz( ) ) and principal component analysis (PCA, 0.007 mV'. Hz( ) ). Overall, by incorporating STDP to 208 capture temporal information from neuronal recordings, SPARKS can uncover the underlying topology of 209 neuronal representations. In the following sections, we further illustrate the numerous capabilities of the 210 model by using datasets acquired from various brain regions through different recording techniques. 211 212 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint Expressive, Low-Dimensional and Robust Encodings 213 214 Figure 3 Expressive latent embeddings and state-of-the-art prediction from somatosensory cortex data. (A) Illustration 215 of a single trial in the reaching task. (B) Average hand positions and corresponding latent embeddings obtained through 216 SPARKS supervised, CEBRA supervised and SPARKS unsupervised learning. (C) Comparison of 𝑅# scores for prediction 217 of hand position. Orange lines represent the median over 5 runs, and each dot an individual run. (D) Evolution of 218 supervised prediction score with latent space dimensionality. (E) Unsupervised vs. supervised prediction scores for the 219 same target variables. Each value represents the mean over 5 individual runs. (F) Consistency of latent embeddings for 220 supervised learning across hyperparameter values: past window (𝜏!), future window (𝜏") and STDP temporal decay (𝜏$). 221 (G) Example unsupervised latent embeddings across different model initialisations. (H) Consistency of latent embeddings 222 for unsupervised learning across prediction window length 𝜏". Shaded areas represent standard deviation over 5 223 individual runs. ***: p < 0.001, Mann-Whitney U-test. r: Pearson correlation coefficient. p: p-value, t-test. 224 225 Next, we consider data obtained from electrophysiology recordings in Brodmann’s area 2 of the 226 somatosensory cortex in monkeys performing a centre-out task (Figure 3A). We use a single recording from 227 the Area2 Bump data set [32], comprising 65 neurons, which was proposed as part of the Neural Latents 228 Benchmark competition [33]. We trained SPARKS via supervised learning to decode the continuous position 229 of the animal’s hand, unless mentioned otherwise. 230 231 One of our main goals is to produce latent representations of the data that can be directly related to relevant 232 parameters. In this case, the latent embeddings obtained via supervised learning are seen to closely match the 233 trajectories of the animal’s hand (Figure 3B). Interestingly, even when training the model via unsupervised 234 learning, we found that the latent trajectories produced by SPARKS are clearly separated across different trial 235 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint conditions. The state-of-the-art model [16] was unable to produce embeddings that can be visually matched 236 to the desired reference trajectories, despite providing representations that are clearly separated across trial 237 directions (Figure 3B). To evaluate the role of the proposed attention mechanism, we trained alternative 238 models using the proposed objective. We found that traditional autoencoders did not produce latent 239 embeddings that allowed us to distinguish between different target directions, regardless of the encoder being 240 recurrent or not (Figure S 3A), which we assumed to be necessary to process time-series data. We also 241 trained a model for which we replaced our Hebbian attention layer with conventional dot-product attention. 242 This model produced less visually informative embeddings, notably at the movement onset (Figure S 3A). 243 Alongside the findings from the previous section, these results motivate the use of the Hebbian attention layer 244 introduced in this work to produce low-dimensional representations of neural recordings. 245 246 247 SPARKS also offers very high predictive power, allowing us to obtain highly accurate reconstruction of tasks 248 variables at a single trial level, such as the trajectory of the hand position in this reaching task (Figure 3C). 249 When training the model via supervised learning, predictive power was quantified as the 𝑅' score obtained 250 using 20% of examples held out for testing (Figure 3C, Methods). This metric is directly obtained from the 251 decoder’s outputs, and we obtain 𝑅' = 91.2% for the decoding of hand position with 𝑑 = 3 latent 252 dimensions. SPARKS outperformed the state-of-the-art benchmark [16], for 3, and even 16 latent dimensions 253 (Figure 3D, R' = 65%, 𝑝 < 0.001; and R' = 88.6%, 𝑝 < 0.001, respectively; one-sided Mann-Whitney U-254 test), as well as all the other models that we implemented (Figure S 3B). For the model proposed in reference 255 [16], predictions were obtained using an auxiliary prediction model, namely a k-nearest neighbour (KNN) 256 regression model. When training SPARKS via unsupervised learning, we can similarly decode hand positions 257 from the latent embeddings via a KNN model. Remarkably, we found that our unsupervised model still 258 achieves higher accuracy than the supervised version of the benchmark, achieving R' = 68.2% with 3 latent 259 dimensions (Figure 3D, 𝑝 < 0.001, one-sided Mann-Whitney U-test). We hence investigated how model 260 parameters affected the prediction capabilities of SPARKS. We found that increasing the length of the past 261 samples window 𝜏* from 1 to 10 ms improved the performance of the model, but performance plateaued 262 beyond this value (Figure S 3C), hinting at relevant timescales of population code and synaptic transmission. 263 For supervised learning, increasing the value of the prediction window 𝜏+ decreases the prediction 264 performance of the model, although it strongly improved model performance for unsupervised learning 265 (Figure S 3D). We also found that selecting a value of the temporal decay 𝜏& that is relevant to the timescale 266 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint of the behaviour was necessary for the model to perform well, with values in the range of 100 to 500 ms 267 performing best. Finally, increasing the latent dimension slightly increased the accuracy of the model, 268 reaching 𝑅' = 93% with a 64-dimensional embedding (Figure 3D), which is relevant to model higher-269 dimensional behaviours [34]. However, note that the high performance SPARKS consistently achieves with 270 only three latent dimensions is key for obtaining highly informative visualizations of neuronal 271 representations. 272 273 In the absence of relevant parameters to perform supervised learning, our model still offers the capacity to 274 uncover the information encoded in the neuronal responses in an unsupervised fashion. To illustrate this, we 275 trained KNN regression models to predict various behavioural variables from the embeddings produced via 276 unsupervised learning. Specifically, we selected the five variables—hand position and velocity, target 277 direction, moment of force applied to the controller, and muscle length—that exhibited the highest 278 unsupervised accuracy. Subsequently, we independently trained our model via supervised learning using each 279 variable as reference signal. We found that the scores obtained from supervised learning exhibit a remarkably 280 high positive correlation with those obtained from unsupervised learning (Pearson correlation coefficient 𝑟 =281 0.91, 𝑝 < 0.01, Figure 3E). The unsupervised model hence encodes into the latent space information about 282 the behavioural parameters found in the neural signal, weighted in accordance with their importance in the 283 spike trains. The resulting latent trajectories can be used to investigate biological questions, for instance by 284 exploring which dimensions encode different trial types, outcomes, or sensory information. 285 286 Finally, the encodings obtained with SPARKS are robust, both when obtained through supervised and 287 unsupervised learning. To quantitatively evaluate consistency across latent spaces, we trained linear 288 regression models between the embeddings obtained across different repetitions of the training procedure 289 with random initialisations, and evaluated the resulting predictive 𝑅' score. For supervised learning, the 290 consistency of the embeddings also remains high for all reference signals (Figure S 3E), with a positive 291 correlation between supervised reconstruction accuracy and consistency (Figure S 3F). This suggests that the 292 more informative the neural recording is about the reference signal, the more the latent representation is 293 constrained by the data. We also found very high levels of consistency for a large range of past samples 294 window length 𝜏*, predictive window 𝜏+ and STDP temporal decay 𝜏& (Figure 3F), but also other model 295 parameters (Figure S 3G). However, the latent representations derived from distinct behavioural variables 296 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint cannot consistently be related to one another via linear regression (Figure S 3E), hinting at the fact that the 297 information encoded for each variable resides within separate lower-dimensional manifolds. When training 298 via unsupervised learning with different random weight initialisations, SPARKS also produces embeddings 299 that are visually similar across repetitions (Figure 3G). Increasing the number of future samples predicted up 300 to 𝜏+ = 10 ms improves the consistency of embeddings across random initialisations (Figure 3H), which can 301 be linked to the optimal value for the parameter 𝜏* for supervised learning. Taken together, these results 302 demonstrate how SPARKS produces meaningful and robust low-dimensional representations of neural data, 303 providing state-of-the-art performance with only tens of neurons. 304 305 Reconstructing Visual Stimuli in Passive Mice 306 307 308 Figure 4 Frame reconstruction and prediction from primary visual cortex recordings. (A) Schematic of Neuropixels 309 recordings in passive mice viewing a naturalistic film. (B) Illustration of the frame reconstruction task: SPARKS was 310 trained to reconstruct actual frames from neural recordings. (C) Example frames from the naturalistic movie and 311 corresponding reconstructions using 100 neurons recorded from mouse primary visual cortex. (D) 1s decoding accuracy 312 as a function of number of neurons for prediction within block (yellow), across two recording blocks separated by 90 313 minutes (brown), and when training on data from both blocks (black). Within block decoding accuracy is shown for [16] 314 (pink). Error bars represent standard deviations over five repetitions with random sampling of individual units from all 315 recordings. (E) Representative examples of two-dimensional embeddings obtained from the neuronal responses to 316 natural movies. From left to right: SPARKS trained for frame reconstruction, SPARKS trained for frame prediction, 317 CEBRA trained for frame prediction, SPARKS trained via unsupervised learning. (F) Normalised pairwise distances 318 between frames, computed as inverse, normalised 2-dimensional correlation coefficients; and normalised Euclidian 319 distances between 2-dimensional frame embeddings produced by each model, obtained on a single test example. Same 320 order as in (D). 321 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint 322 Next, we asked how our model can be used to decode higher-dimensional stimuli such as natural movie 323 responses, as opposed to low-dimensional parameters such as the direction of the target in the previous task. 324 To this end, we analysed Neuropixels recordings in the primary visual cortex obtained from the Allen visual 325 coding dataset [35]. We focused on passive recordings in mice viewing a 30 second excerpt from the opening 326 of the movie Touch of Evil (Figure 4A). In each session, the movie was presented 20 times in two blocks of 327 10 repetitions, separated by approximately 90 minutes. As illustrated in Figure 4B, we tackled the task of 328 reconstructing individual movie frames from neural activity. To do so, we constructed a pseudo-mouse by 329 stacking the recordings from individual units, chosen at random across all 58 sessions. We started by training 330 SPARKS on the first 9 presentations of the movie from the first block, leaving the last one out for testing, a 331 procedure that we refer to as testing within block. We show examples of reconstructed frames in Figure 4C, 332 demonstrating that the model captures high-level scene context, as well as small, fast-moving details (Suppl. 333 Movie 1). 334 335 For numerical evaluation, we tested our model on the easier task of predicting frame IDs (i.e., indices ranging 336 from 0-900, 30s movie, 30 fps, Figure 4D). Note that the state-of-the-art reference [16] used as benchmark 337 did not operate directly on the recordings, but first extracted features from the neural data using a 338 Transformer network [16]. Yet, for prediction within block, our model is seen to outperform the best results 339 obtained with the benchmark ([16], 87.4% accuracy, 𝑁 = 1,000) using 5 times fewer neurons (SPARKS, 340 92.3% accuracy, 𝑁 = 200, 𝑝 < 0.01, one-sided Mann-Whitney U-test), suggesting that SPARKS is able to 341 capture key information from the neural data within a small neuronal population. 342 343 We also tackled the more challenging task of prediction across blocks. To this end, we tested our model on 344 the second set of 10 presentations after training on all examples from the first block, with the two blocks 345 separated by 90 minutes. Analysing neural activity recorded over periods ranging from tens of minutes to 346 hours or more is known to be a difficult task due to a phenomenon known as neural drift [36], [37], [38], 347 [39]. Despite these challenges, we found that our model maintains remarkably high accuracy, and even 348 performs as well for predictions across blocks as it does within the same block when 𝑁 = 1,000 neurons are 349 used. 350 351 In the latent space, SPARKS assigns data points from test examples in the second block close to where 352 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint corresponding points from the training set lie in the embedding space (Figure S 4B). The model is hence 353 capable of ignoring the noise caused by neural drift to only extract the meaningful information from the 354 neural data. However, we obtained worse performance and higher variance across repetitions for prediction 355 across blocks when using fewer than 1,000 neurons, indicating that neurons are unevenly impacted by the 356 effects of neural drift across sessions and animals. We also found that training using the first 9 repetitions 357 from both blocks maintains the performance of the model (Figure 4D) compared to training using a single 358 block. This highlights the capacity of SPARKS to find robust latent representations across neural recordings 359 captured over long periods. Additionally, we performed frame prediction using spiking data obtained from 360 pre-processed calcium recordings sessions from the same dataset [40]. We found that SPARKS still performs 361 similarly when decoding information from calcium dynamics (Figure S 4C), establishing our algorithm as the 362 current state-of-the-art for prediction from both types of neural recordings. 363 364 To evaluate how the latent representations can give insights about the structure of the input information, in 365 this case the natural scenes, we compared the embeddings obtained through supervised and unsupervised 366 learning after training on the first block using 200 randomly chosen units. Training was carried out via 367 unsupervised learning, or via supervised learning by performing either frame reconstruction or classification. 368 After training via supervised learning, we found that the embeddings produced with both targets form neural 369 trajectories that are closely matched (Figure 4E). Although the neural trajectory obtained via unsupervised 370 learning differs from the other two, we still find that the embeddings of frames from different sections of the 371 movie are clearly distinct in the latent space. To quantitatively evaluate these embeddings, we measured 372 normalised pairwise distances between frame embeddings produced by each model. As a reference, we also 373 computed normalised distances between movie frames (Figure 4E, Methods). We find that in each case, the 374 frame embeddings produced by SPARKS are close in space for frames that are close in time. Remarkably, a 375 structure that can be likened to that of the movie correlations emerges not only for the supervised model, but 376 also for the one trained via unsupervised learning (Figure 4F). Conversely, despite enforcing very strong 377 structure, the embeddings produced by the benchmark model closely position in the latent space frames that 378 are distant in time or correspond to different movie scenes. We were not able to recover such structure from 379 either movie or neural correlations (Figure 4F, Figure S 4D), which implies that previous methods that ignore 380 the sequential nature of neural signals do not allow for their faithful representation. 381 382 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint Learning Across Sessions and Animals 383 384 385 Figure 5 SPARKS provides stable embeddings across sessions and animals. (A) 1s frame decoding accuracy for 386 prediction of frame IDs with SPARKS for 5 example sessions. Orange lines represent the median over 5 runs, and each 387 dot an individual run. (B) Schematic illustrating procedure for training over different recordings with SPARKS. The 388 multisession model is obtained by learning a different Hebbian attention layer for each session while keeping the rest of 389 the network shared across sessions. (C) Decoding accuracy of models trained on all sessions and tested on individual 390 sessions versus models trained and tested on single sessions. Each value represents the mean over 5 individual runs. (D) 391 Two dimensional embeddings corresponding to different sessions obtained from the multi-session model. (E) Frame 392 decoding accuracy as a function of model size. Orange lines represent the median over 5 runs, and each dot an individual 393 run. (F) Frame decoding accuracy as a function of finetuning epochs. Green trace corresponds to the pre-trained 394 dataset. Blue trace shows accuracy for a newly introduced dataset to an existing model. Shaded areas represent standard 395 deviation over 5 pre-trained models. 396 397 Despite recent developments [5], [41], obtaining stable chronic electrophysiological recordings in mice 398 remains very challenging due to technical limitations such as tissue inflammation and positional changes of 399 the electrodes over time. As a result, researchers often rely on acute recordings. Combining such recordings 400 is common practice and particularly relevant when the brain region of interest allows for the recording of 401 only a few neurons at a time [35], [42]. This approach, however, has the additional confounding effect of 402 masking differences in animal behaviour and state across trials and sessions, even when experimental 403 conditions and stimulation protocols are carefully controlled [43], [44]. Here, we illustrate how SPARKS can 404 tackle this challenge in the task of predicting frames IDs from the Allen Visual dataset. We first trained 405 models using all the neurons from 5 randomly selected sessions, each recorded within a single block. We 406 found a high variance in the predictive accuracy across sessions (Figure 5A). This effect is not explained by 407 the number of neurons in each recording (Figure S 4A, small negative correlation 𝑟 = −0.64, 𝑝 > 0.1), 408 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint reinforcing the idea that not all recordings from the same brain region are equally informative. Interestingly, 409 despite the high variance in the predictive power across sessions, the embeddings produced by the model 410 remain highly consistent, as measured by the predictive 𝑅' obtained after fitting a linear prediction model 411 between the embeddings produced from different recordings (𝑅' > 99% across all pairs, Figure S 4E). To 412 demonstrate robust biological findings, it is necessary to compare embeddings across sessions within the 413 same latent space, which may require extensive work when training the model on single sessions. As 414 discussed, stacking recordings to create pseudo-recordings with hundreds or thousands of units can mitigate 415 this issue. However, it is not always practical to do so, especially for tasks where trials vary in length due to 416 the varying response times of the animals. Combining recordings also ignores varying behavioural and 417 cognitive animal states, such as level of arousal. 418 419 To overcome this issue, SPARKS can be employed to encode data from several sessions within the same 420 latent space. This is achieved by learning a different Hebbian attention layer for each session while keeping 421 the rest of the network shared across sessions (Figure 5B). When applying this procedure to the same 5 422 sessions from the Allen visual dataset, we find that the resulting model maintains predictive accuracy similar 423 to that obtained when training on single sessions, with a very high positive correlation between the two 424 (Figure 5C, 𝑟 = 0.94, 𝑝 < 0.01). Furthermore, we find that the resulting embeddings from each session are 425 closely related (Figure 5D). This demonstrates the capability of our method to find a common latent space 426 which reliably represents the encoding of this natural stimulus within the primary visual cortex. Our results 427 are scalable across model dimensions, which remain small compared to modern LMM architectures (Figure 428 5E, [45], [46], [47]), hence offering the possibility to train our model without access to the latest GPU 429 architectures. 430 431 Finally, analysing new sessions from a previously trained model is straightforward via addition of a new 432 Hebbian attention layer and finetuning (Figure 5F). We find that the model reaches its final accuracy on the 433 new session within ~10 finetuning epochs, with no loss of accuracy when tested on the sessions it was pre-434 trained on, and reaches similar accuracy to that of a model trained on the single additional session (Figure S 435 4G). Overall, SPARKS is seen to offer an efficient, powerful and scalable solution to the analysis of neural 436 data across sessions and animals by bringing within the same latent space representations from different 437 datasets. This can then be used to explore how, for example, behavioural variables change the embeddings 438 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint across mice. 439 440 Uncovering the Functional Hierarchy of Mouse Visual Cortical 441 Areas 442 443 Figure 6 SPARKS recovers the hierarchy of the visual cortex. (A) Neuropixels recordings in passive mice viewing static 444 gratings. (B) Organisation of mice visual cortical areas. (C) Evolution of the average spatial frequency prediction error 445 over time for each visual cortical area. Lines represent average over 5 individual runs. (D) Average prediction error for 446 each visual cortical area and spatial frequency. (E) Evolution of the prediction error over time for each visual cortical 447 area and spatial frequency. Lines represent average over 5 individual runs, shaded areas represent standard deviations. 448 (F) Illustration of the computation of projection scores between visual cortical areas. (G) Projection scores between all 449 visual cortical areas. (H) Functional hierarchy obtained with SPARKS, calculated by computing the difference between 450 their incoming and outgoing projections. 451 452 An important capability of our model is its ability to leverage the temporal dynamics of neuronal data [48] to 453 reveal biological insights that can extend to large neuronal networks. To illustrate this, we studied multi-454 region interactions from Neuropixels recordings obtained from the mouse primary visual cortex (VISp) and 455 five higher-order visual cortical areas (HVAs), latero-medial area (LM), antero-lateral area (AL), rostro-456 lateral area (RL), postero-medial area (PM) and antero-medial (AM) from the Allen visual coding dataset 457 (Figure 6A-B) [49]. We trained separate models using 700 neurons recorded in each HVA from awake, head-458 fixed mice passively presented with static gratings with various spatial frequencies. We trained SPARKS to 459 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint predict the spatial frequency of the grating presented to the animals at each time-step with a temporal 460 resolution of 1 millisecond. We focused on spatial frequency because the time course of responses in the 461 primary visual cortex of rodents and primates has been linked to spatial frequency preference, consistent with 462 the coarse-to-fine theory of visual processing [50], [51], [52]. We reasoned that the temporal encoding of 463 spatial frequency across HVAs would be informative of the hierarchical organization of the visual system. As 464 seen in Figure 6C, the models trained using data from all HVAs attain high accuracy across spatial 465 frequencies, with a final average error lower than 0.05 cycles/degree. All of the models start their prediction 466 approximately 50ms from stimulus presentation, but show different dynamics, with VISp reaching its highest 467 accuracy first. We found that models trained on data from visual areas that are higher in the functional 468 hierarchy [49], [53] demonstrate slower dynamics but attain higher accuracies. This general picture hides 469 differences in the areal response to various spatial frequencies: for instance, VISp achieves the lowest error of 470 all areas only for the lowest spatial frequency (Figure 6D). There are also large differences in how quickly 471 gratings are encoded across areas and spatial frequencies; for example, area VISrl provides a very fast but 472 poor estimation of high spatial frequencies, while being more accurate but slower for medium spatial 473 frequencies (Figure 6E). 474 475 To uncover dynamic interactions between the different HVAs during bottom-up visual stimulation, we 476 formulated the following hypothesis: an area that generates an accurate representation of a given spatial 477 frequency will selectively transmit this information to the other areas it projects to, provided that these areas 478 have a higher prediction error for the same spatial frequency. In order to remain agnostic and avoid 479 assumptions about connectivity between these areas, we assume that all HVAs are densely connected to each 480 other, and for each pair of areas (𝑖, 𝑗), we computed a projection score from area 𝑖 to area 𝑗 as the differential 481 integral between the temporal prediction between HVAs 𝑖 and 𝑗 (Figure 6E). Considering only positive 482 interactions, we averaged across spatial frequencies and obtained a matrix of projection scores (Figure 6F, 483 Methods). Similarly to previous works [49], [54], we found that VISp is the area with most outgoing 484 projections, while area VISam receives the most inputs from other regions. We then computed hierarchy 485 scores for each area, defined as the difference between incoming projections to the area and outgoing 486 projections from that area. The functional hierarchy we obtained from the temporal encoding of spatial 487 frequencies (Figure 6H) matches the anatomical and functional organization of mouse cortical visual areas 488 previously described [49], [53], demonstrating SPARKS’ capability to account for the temporal dynamics of 489 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint neuronal signals to reveal large-scale network interactions. 490 491

Discussion

492 Motivated by the advancement of large-scale neuronal recordings and generative machine learning 493 techniques, we introduced SPARKS, a biologically inspired attention model for the analysis of neuronal 494 signals. This new technique can be used to obtain low dimensional embeddings that meaningfully represent 495 the information content from high dimensional data with minimum pre-processing. SPARKS addresses 496 previous limitations by catering to the sequential nature of neuronal signals, and processing information at the 497 single spike level with no need for data processing beyond spike sorting. The high predictive power and 498 latent embeddings obtained with our model allowed us to extract meaningful biological insights from existing 499 data, as validated by revealing temporal dynamics across brain regions, such as the MEC, somatosensory and 500 visual cortex during passive and active behaviours and across different recording techniques. 501 502 The core innovation in SPARKS’ architecture lies within the Hebbian attention layer, which adapts the highly 503 effective attention mechanism to process spiking signals. Implemented within Transformer architectures, 504 attention is at the centre of recent AI successes, with the development of highly performing large language 505 models (LLMs) and large multimodal models (LMMs) [45], [46], [47], [18]. Attention mechanisms 506 selectively focus on relevant input data, and by assigning weights to relevant inputs, emphasise important 507 features while suppressing irrelevant ones. This enables models to process information more efficiently and 508 effectively, improving performance in various tasks. Conventional attention has been proposed as a model of 509 sparse distributed memory, highlighting the biological relevance of the approach [55]. By taking the inverse 510 approach to derive an attention mechanism from a biological principle, we achieved state-of-the-art 511 performance on various benchmarks, demonstrating the relevance of using the synergies between AI systems 512 development and our biological understanding of intelligence. This is reinforced by the proposed learning 513

Objective

using predictive causally conditioned distributions, inspired by predictive coding in the brain, 514 which is key to obtaining highly reliable latent embeddings. This is in line with recent work highlighting the 515 benefits for LLMs to simultaneously predict several tokens [56]. 516 517 Our approach is further justified from the perspective of the Bayesian brain hypothesis, modelling the 518 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint autoencoder architecture as a Bayesian encoding-decoding machine [57]. The neural signal measured and 519 forming the inputs 𝒙 ∼ 𝑝(𝒙|𝒛, 𝒓) to our model depends on both the animal’s internal generative model of the 520 world 𝑝(𝒛|𝒓), and its sensory inputs 𝒓. In our model, the encoder performs the approximate posterior 521 inference 𝑝, (𝒛|𝒙, 𝒓) given the neural response to the current sensory inputs and parametric weights 𝒘. Due 522 to the noise inherent to neural coding and the probabilistic nature of the encoding 𝒛, the decoding network is 523 a Bayesian decoder tasked to estimate 𝑞, (𝒓|𝒛, 𝒙). By training encoding and decoding networks together, we 524 obtain latent embeddings that are highly consistent across animals and sessions. This aligns with the Bayesian 525 brain hypothesis that the brain’s internal model is calibrated through evolution and exposure to natural 526 stimuli, and evolves slowly during adulthood. We suggest that our model provides an estimate of the animal’s 527 latent generative model of the world, providing a new milestone towards a foundation model for neuroscience 528 [58]. 529 530 Although the memory requirements of the Hebbian attention layer scale quadratically with the number of 531 neurons in the recording, we demonstrated results considering up to 1,000 neurons trained on a single mid-532 range GPU. By approximating the costly backpropagation-through-time algorithm (Methods), our method is 533 agnostic to the recording duration in terms of memory requirements. As a result, SPARKS will be suitable for 534 analysing electrophysiological data as the scale of these recordings increases with technological 535 advancements. However, the technique cannot yet scale to the latest imaging datasets, comprising up to 536 1,000,000 neurons recorded simultaneously. A limitation of our approach is the computation of attention 537 coefficients between each neuron in the recording, which is not in line with the sparse connectivity measured 538 in the brain. To tackle the challenge of analysing such data, future research directions notably aim to 539 incorporate connectivity data into the model, and to use recent sparse approximations of traditional attention 540 mechanisms [59]. 541 542

References

543 544 [1] N. A. Steinmetz et al., ‘Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain 545 recordings’, Science, vol. 372, no. 6539, p. eabf4588, Apr. 2021, doi: 10.1126/science.abf4588. 546 [2] C.-H. Yu, J. N. Stirman, Y. Yu, R. Hira, and S. L. Smith, ‘Diesel2p mesoscope with dual independent 547 scan engines for flexible capture of dynamics in distributed neural circuitry’, Nat. Commun., vol. 12, no. 548 1, p. 6639, Nov. 2021, doi: 10.1038/s41467-021-26736-4. 549 [3] J. Demas et al., ‘High-speed, cortex-wide volumetric recording of neuroactivity at cellular resolution 550 using light beads microscopy’, Nat. Methods, vol. 18, no. 9, pp. 1103–1111, Sep. 2021, doi: 551 10.1038/s41592-021-01239-8. 552 [4] T. Nöbauer, Y. Zhang, H. Kim, and A. Vaziri, ‘Mesoscale volumetric light-field (MesoLF) imaging of 553 neuroactivity across cortical areas at 18 Hz’, Nat. Methods, vol. 20, no. 4, pp. 600–609, Apr. 2023, doi: 554 10.1038/s41592-023-01789-z. 555 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint [5] A. Yuan, J. Colonell, A. Lebedeva, M. Okun, A. Charles, and T. Harris, ‘Multi-day Neuron Tracking in 556 High Density Electrophysiology Recordings using EMD’, eLife, vol. 12, Dec. 2023, doi: 557 10.7554/eLife.92495.1. 558 [6] A. E. Urai, B. Doiron, A. M. Leifer, and A. K. Churchland, ‘Large-scale neural recordings call for new 559 insights to link brain and behavior’, Nat. Neurosci., vol. 25, no. 1, pp. 11–19, Jan. 2022, doi: 560 10.1038/s41593-021-00980-9. 561 [7] M. Safaie et al., ‘Preserved neural dynamics across animals performing similar behaviour’, Nature, vol. 562 623, no. 7988, pp. 765–771, Nov. 2023, doi: 10.1038/s41586-023-06714-0. 563 [8] J. A. Gallego, M. G. Perich, S. N. Naufel, C. Ethier, S. A. Solla, and L. E. Miller, ‘Cortical population 564 activity within a preserved neural manifold underlies multiple motor behaviors’, Nat. Commun., vol. 9, 565 no. 1, p. 4233, Oct. 2018, doi: 10.1038/s41467-018-06560-z. 566 [9] B. Agüera y Arcas, A. L. Fairhall, and W. Bialek, ‘Computation in a single neuron: Hodgkin and Huxley 567 revisited’, Neural Comput., vol. 15, no. 8, pp. 1715–1749, Aug. 2003, doi: 568 10.1162/08997660360675017. 569 [10] B. A. Moser and M. Lunglmayr, ‘Spiking Neural Networks in the Alexiewicz Topology: A New 570 Perspective on Analysis and Error Bounds’, May 09, 2023, arXiv: arXiv:2305.05772. doi: 571 10.48550/arXiv.2305.05772. 572 [11] C. Stringer, M. Pachitariu, N. Steinmetz, C. B. Reddy, M. Carandini, and K. D. Harris, ‘Spontaneous 573 behaviors drive multidimensional, brainwide activity’, Science, vol. 364, no. 6437, p. eaav7893, Apr. 574 2019, doi: 10.1126/science.aav7893. 575 [12] J. C. Marques, M. Li, D. Schaak, D. N. Robson, and J. M. Li, ‘Internal state dynamics shape brainwide 576 activity and foraging behaviour’, Nature, vol. 577, no. 7789, pp. 239–243, Jan. 2020, doi: 577 10.1038/s41586-019-1858-z. 578 [13] M. R. Cohen and J. H. R. Maunsell, ‘A Neuronal Population Measure of Attention Predicts Behavioral 579 Performance on Individual Trials’, J. Neurosci., vol. 30, no. 45, pp. 15241–15253, Nov. 2010, doi: 580 10.1523/JNEUROSCI.2171-10.2010. 581 [14] M. M. Churchland et al., ‘Neural population dynamics during reaching’, Nature, vol. 487, no. 7405, pp. 582 51–56, Jul. 2012, doi: 10.1038/nature11129. 583 [15] C. K. Machens, R. Romo, and C. D. Brody, ‘Functional, But Not Anatomical, Separation of “What” and 584 “When” in Prefrontal Cortex’, J. Neurosci., vol. 30, no. 1, pp. 350–360, Jan. 2010, doi: 585 10.1523/JNEUROSCI.3276-09.2010. 586 [16] S. Schneider, J. H. Lee, and M. W. Mathis, ‘Learnable latent embeddings for joint behavioral and neural 587 analysis’, Oct. 05, 2022, arXiv: arXiv:2204.00673. doi: 10.48550/arXiv.2204.00673. 588 [17] A. Vaswani et al., ‘Attention is All you Need’. 589 [18] S. Yin et al., ‘A Survey on Multimodal Large Language Models’, Apr. 01, 2024, arXiv: 590 arXiv:2306.13549. Accessed: Jul. 09, 2024. [Online]. Available: http://arxiv.org/abs/2306.13549 591 [19] M. Azabou et al., ‘A Unified, Scalable Framework for Neural Population Decoding’, Oct. 24, 2023, 592 arXiv: arXiv:2310.16046. doi: 10.48550/arXiv.2310.16046. 593 [20] D. P. Kingma and M. Welling, ‘Auto-Encoding Variational Bayes’, Dec. 10, 2022, arXiv: 594 arXiv:1312.6114. doi: 10.48550/arXiv.1312.6114. 595 [21] A. Schulz et al., ‘Modeling conditional distributions of neural and behavioral data with masked 596 variational autoencoders’, Apr. 25, 2024, bioRxiv. doi: 10.1101/2024.04.19.590082. 597 [22] G. Kramer, ‘Directed information for channels with feedback’, Doctoral Thesis, ETH Zurich, 1998. doi: 598 10.3929/ethz-a-001988524. 599 [23] N. Skatchkovsky, O. Simeone, and H. Jang, ‘Learning to Time-Decode in Spiking Neural Networks 600 Through the Information Bottleneck’, in Advances in Neural Information Processing Systems, Curran 601 Associates, Inc., 2021, pp. 17049–17059. Accessed: Dec. 04, 2023. [Online]. Available: 602 https://proceedings.neurips.cc/paper/2021/hash/8da57fac3313174128cc5f13328d4573-Abstract.html 603 [24] M. Yao et al., ‘Attention Spiking Neural Networks’, Sep. 28, 2022, arXiv: arXiv:2209.13929. Accessed: 604 Jul. 05, 2023. [Online]. Available: http://arxiv.org/abs/2209.13929 605 [25] Z. Zhou et al., ‘SPIKFORMER: WHEN SPIKING NEURAL NETWORK MEETS TRANSFORMER’, 606 2023. 607 [26] R.-J. Zhu, Q. Zhao, G. Li, and J. K. Eshraghian, ‘SpikeGPT: Generative Pre-trained Language Model 608 with Spiking Neural Networks’, Jun. 26, 2023, arXiv: arXiv:2302.13939. Accessed: Jul. 05, 2023. 609 [Online]. Available: http://arxiv.org/abs/2302.13939 610 [27] D. O. Hebb, The organization of behavior; a neuropsychological theory. in The organization of 611 behavior; a neuropsychological theory. Oxford, England: Wiley, 1949, pp. xix, 335. 612 [28] R. C. Malenka and and R. A. Nicoll, ‘Long-Term Potentiation--A Decade of Progress?’, Science, vol. 613 285, no. 5435, pp. 1870–1874, Sep. 1999, doi: 10.1126/science.285.5435.1870. 614 [29] G. Q. Bi and M. M. Poo, ‘Synaptic modifications in cultured hippocampal neurons: dependence on spike 615 timing, synaptic strength, and postsynaptic cell type’, J. Neurosci. Off. J. Soc. Neurosci., vol. 18, no. 24, 616 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint pp. 10464–10472, Dec. 1998, doi: 10.1523/JNEUROSCI.18-24-10464.1998. 617 [30] G. Buzsáki and E. I. Moser, ‘Memory, navigation and theta rhythm in the hippocampal-entorhinal 618 system’, Nat. Neurosci., vol. 16, no. 2, pp. 130–138, Feb. 2013, doi: 10.1038/nn.3304. 619 [31] S. Gonzalo Cogno et al., ‘Minute-scale oscillatory sequences in medial entorhinal cortex’, Nature, vol. 620 625, no. 7994, pp. 338–344, Jan. 2024, doi: 10.1038/s41586-023-06864-1. 621 [32] R. H. Chowdhury, J. I. Glaser, and L. E. Miller, ‘Area 2 of primary somatosensory cortex encodes 622 kinematics of the whole arm’, eLife, vol. 9, p. e48198, Jan. 2020, doi: 10.7554/eLife.48198. 623 [33] F. Pei et al., ‘Neural Latents Benchmark ’21: Evaluating latent variable models of neural population 624 activity’, Jan. 17, 2022, arXiv: arXiv:2109.04463. doi: 10.48550/arXiv.2109.04463. 625 [34] J. Manley et al., ‘Simultaneous, cortex-wide dynamics of up to 1 million neurons reveal unbounded 626 scaling of dimensionality with neuron number’, Neuron, vol. 112, no. 10, pp. 1694-1709.e5, May 2024, 627 doi: 10.1016/j.neuron.2024.02.011. 628 [35] S. E. J. de Vries et al., ‘A large-scale standardized physiological survey reveals functional organization 629 of the mouse visual cortex’, Nat. Neurosci., vol. 23, no. 1, pp. 138–151, Jan. 2020, doi: 10.1038/s41593-630 019-0550-9. 631 [36] H. Lütcke, D. J. Margolis, and F. Helmchen, ‘Steady or changing? Long-term monitoring of neuronal 632 population activity’, Trends Neurosci., vol. 36, no. 7, pp. 375–384, Jul. 2013, doi: 633 10.1016/j.tins.2013.03.008. 634 [37] T. D. Marks and M. J. Goard, ‘Stimulus-dependent representational drift in primary visual cortex’, Nat. 635 Commun., vol. 12, no. 1, p. 5169, Aug. 2021, doi: 10.1038/s41467-021-25436-3. 636 [38] L. N. Driscoll, N. L. Pettit, M. Minderer, S. N. Chettih, and C. D. Harvey, ‘Dynamic Reorganization of 637 Neuronal Activity Patterns in Parietal Cortex’, Cell, vol. 170, no. 5, pp. 986-999.e16, Aug. 2017, doi: 638 10.1016/j.cell.2017.07.021. 639 [39] S. Sadeh and C. Clopath, ‘Contribution of behavioural variability to representational drift’, eLife. 640 Accessed: Jul. 08, 2024. [Online]. Available: https://elifesciences.org/articles/77907 641 [40] D. Deitch, A. Rubin, and Y. Ziv, ‘Representational drift in the mouse visual cortex’, Curr. Biol., vol. 31, 642 no. 19, pp. 4327-4339.e6, Oct. 2021, doi: 10.1016/j.cub.2021.07.062. 643 [41] E. H. van Beest et al., ‘Tracking neurons across days with high-density probes’, Oct. 16, 2023, bioRxiv. 644 doi: 10.1101/2023.10.12.562040. 645 [42] N. A. Steinmetz, P. Zatka-Haas, M. Carandini, and K. D. Harris, ‘Distributed coding of choice, action 646 and engagement across the mouse brain’, Nature, vol. 576, no. 7786, pp. 266–273, Dec. 2019, doi: 647 10.1038/s41586-019-1787-x. 648 [43] A. Lak et al., ‘Reinforcement biases subsequent perceptual decisions when confidence is low, a 649 widespread behavioral phenomenon’, eLife, vol. 9, p. e49834, Apr. 2020, doi: 10.7554/eLife.49834. 650 [44] S. Musall, M. T. Kaufman, A. L. Juavinett, S. Gluf, and A. K. Churchland, ‘Single-trial neural dynamics 651 are dominated by richly varied movements’, Nat. Neurosci., vol. 22, no. 10, pp. 1677–1686, Oct. 2019, 652 doi: 10.1038/s41593-019-0502-4. 653 [45] H. Touvron et al., ‘LLaMA: Open and Efficient Foundation Language Models’, Feb. 27, 2023, arXiv: 654 arXiv:2302.13971. doi: 10.48550/arXiv.2302.13971. 655 [46] M. Enis and M. Hopkins, ‘From LLM to NMT: Advancing Low-Resource Machine Translation with 656 Claude’, Apr. 21, 2024, arXiv: arXiv:2404.13813. doi: 10.48550/arXiv.2404.13813. 657 [47] OpenAI et al., ‘GPT-4 Technical Report’, Mar. 04, 2024, arXiv: arXiv:2303.08774. doi: 658 10.48550/arXiv.2303.08774. 659 [48] W. Truccolo, L. R. Hochberg, and J. P. Donoghue, ‘Collective dynamics in human and monkey 660 sensorimotor cortex: predicting single neuron spikes’, Nat. Neurosci., vol. 13, no. 1, pp. 105–111, Jan. 661 2010, doi: 10.1038/nn.2455. 662 [49] J. H. Siegle et al., ‘Survey of spiking in the mouse visual system reveals functional hierarchy’, Nature, 663 vol. 592, no. 7852, pp. 86–92, Apr. 2021, doi: 10.1038/s41586-020-03171-x. 664 [50] J. Hegdé, ‘Time course of visual perception: Coarse-to-fine processing and beyond’, Prog. Neurobiol., 665 vol. 84, no. 4, pp. 405–439, Apr. 2008, doi: 10.1016/j.pneurobio.2007.09.001. 666 [51] P. R. L. Parker et al., ‘A dynamic sequence of visual processing initiated by gaze shifts’, Nat. Neurosci., 667 vol. 26, no. 12, pp. 2192–2202, Dec. 2023, doi: 10.1038/s41593-023-01481-7. 668 [52] S. Vreysen, B. Zhang, Y. M. Chino, L. Arckens, and G. Van den Bergh, ‘Dynamics of spatial frequency 669 tuning in mouse visual cortex’, J. Neurophysiol., vol. 107, no. 11, pp. 2937–2949, Jun. 2012, doi: 670 10.1152/jn.00022.2012. 671 [53] J. A. Harris et al., ‘Hierarchical organization of cortical and thalamic connectivity’, Nature, vol. 575, no. 672 7781, pp. 195–202, Nov. 2019, doi: 10.1038/s41586-019-1716-z. 673 [54] Q. Wang, O. Sporns, and A. Burkhalter, ‘Network analysis of corticocortical connections reveals ventral 674 and dorsal processing streams in mouse visual cortex’, J. Neurosci. Off. J. Soc. Neurosci., vol. 32, no. 675 13, pp. 4386–4399, Mar. 2012, doi: 10.1523/JNEUROSCI.6063-11.2012. 676 [55] T. Bricken and C. Pehlevan, ‘Attention Approximates Sparse Distributed Memory’, Jan. 17, 2022, 677 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint arXiv: arXiv:2111.05498. doi: 10.48550/arXiv.2111.05498. 678 [56] F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, ‘Better & Faster Large Language 679 Models via Multi-token Prediction’, Apr. 30, 2024, arXiv: arXiv:2404.19737. doi: 680 10.48550/arXiv.2404.19737. 681 [57] R. D. Lange, S. Shivkumar, A. Chattoraj, and R. M. Haefner, ‘Bayesian encoding and decoding as 682 distinct perspectives on neural coding’, Nat. Neurosci., vol. 26, no. 12, Art. no. 12, Dec. 2023, doi: 683 10.1038/s41593-023-01458-6. 684 [58] Y. Zhang et al., ‘Towards a “universal translator” for neural dynamics at single-cell, single-spike 685 resolution’, Jul. 23, 2024, arXiv: arXiv:2407.14668. doi: 10.48550/arXiv.2407.14668. 686 [59] M. Zaheer et al., ‘Big Bird: Transformers for Longer Sequences’, Jan. 08, 2021, arXiv: 687 arXiv:2007.14062. doi: 10.48550/arXiv.2007.14062. 688 [60] O. Simeone, ‘Machine Learning for Engineers’, Higher Education from Cambridge University Press. 689 Accessed: Jul. 11, 2024. [Online]. Available: 690 https://www.cambridge.org/highereducation/books/machine-learning-for-691 engineers/7FD8622836CAFCF5EDB169E7DC8A1ED4 692 [61] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, ‘Deep Variational Information Bottleneck’, Oct. 693 23, 2019, arXiv: arXiv:1612.00410. doi: 10.48550/arXiv.1612.00410. 694 [62] H. Jang, O. Simeone, B. Gardner, and A. Gruning, ‘An Introduction to Probabilistic Spiking Neural 695 Networks: Probabilistic Models, Learning Rules, and Applications’, IEEE Signal Process. Mag., vol. 36, 696 no. 6, pp. 64–77, Nov. 2019, doi: 10.1109/MSP.2019.2935234. 697 [63] D. P. Kingma and J. Ba, ‘Adam: A Method for Stochastic Optimization’, Jan. 29, 2017, arXiv: 698 arXiv:1412.6980. doi: 10.48550/arXiv.1412.6980. 699 700 701

Methods

702 Notations 703 For any two jointly distributed random vectors 𝒂 = (𝒂) , … , 𝒂-) and 𝑏 = (𝒃) , … , 𝒃-) from time 1 to 𝑇, at 704 every time-instant 𝑡, we introduce the predictive causally conditioned (PCC) distribution of 𝒂 given 𝒃 as 705 𝑝(/!,/")(𝒂! || 𝒃!) = 𝑝W𝒂! !2/" X𝒂!( ) , 𝒃!( /! ! Y. 706 (1) 707 This distribution captures the causal dependence of the 𝜏' future samples of RV 𝒂! !2/" = (𝒂!, … , 𝒂!2/" ) on 708 the past 𝜏) samples of RV 𝒃!( /" ! = (𝒃!( /" , … , 𝒃!) . If 𝜏) = 𝑇 and 𝜏' = 1, this recovers the causally 709 conditioned distribution [22]. 710 711 Problem definition 712 We study the autoencoder architecture in Figure 1A with the aim of training an encoding network to produce 713 a representation 𝒛 of the neural input 𝒙 that is informative about a reference signal 𝒓. The input 𝒙 is a neural 714 signal obtained using e.g. electrophysiology or imaging techniques. The reference signal 𝒓 represents the 715 corresponding target, such as the label corresponding to the behaviour of interest; a set of behavioural 716 variables; or the input itself for unsupervised learning. 717 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint 718 To simultaneously optimise encoder and decoder, we consider a data set comprising pairs (𝒙, 𝒓) of neural 719 inputs 𝒙 and reference signals 𝒓, jointly generated from an unknown population distribution 𝑝(𝒙, 𝒓). The 720 inputs are in the form of a discrete collection of vectors over time 𝒙 = (𝒙) , … , 𝒙-) with 𝒙! ∈ ℝ", and the 721

Reference

signals 𝒓 = (𝒓) , … , 𝒓-), with 𝒓! ∈ ℝ"#, define general target signals. Without loss of generality, 722 we assume that all signals are uniformly sampled with a fixed sampling rate 𝑑𝑡. 723 724 As seen in Figure 1A, the latent representation 𝒛 is obtained by processing the input signal 𝒙 through the 725 Hebbian encoder introduced in the main text and for which we provide mathematical details in the following. 726 Given 𝒛, the system is trained to predict signal 𝒓 using a decoding network, implemented as an ANN. 727 Following ideas from predictive coding, we maximise the causally conditioned distribution of 𝜏+ future 728 samples from the reference signal 𝒓! !2/$ = (𝒓!, … , 𝒓!2/$ ) given past input spikes 𝒙! = (𝒙) , … , 𝒙!), averaged 729 over time-steps 𝑡 = 1, … , 𝑇 as 730 1 𝑇 3 log 𝑝(𝒓𝑡 𝑡+𝜏𝑓 ||𝒙𝑡) % &'( = 1 𝑇 3 log 𝛦𝒛! 𝑝(𝒓& &*+" |𝒛&, 𝒙&, 𝒓&, ( ) . % &'( 731 (2) 732 Note that the evaluation of objective (2) is intractable for problems of practical size (see, e.g., [60]) and, as 733 we detail in the following sections, we optimise over the model weights by considering a variational 734 approximation. 735 736 Encoding Network 737 Given the architecture proposed in Figure 1A, the encoder implements a causal stochastic mapping between 738 input 𝒙 and latent signal 𝒛 that depends on a vector of parameters 𝒘7 as 739 𝑝𝒘& (𝒛||𝒙) = Z 𝑝𝒘& (𝒛! ||𝒙!) - !9) , 740 where the distribution 𝑝𝒘𝒆 (𝒛𝒕||𝒙𝒕) is obtained through the reparametrisation trick as 741 𝑝𝒘𝒆 (𝒛𝒕||𝒙) ∼ 𝒩 \𝝁W𝒇𝒘& (𝒙!)Y, 𝝈'W𝒇𝒘& (𝒙!)Y`, 742 with 𝒇𝒘𝒆 (⋅) being the output of the encoding network. Samples are obtained as 𝒛 = 𝝁W𝒇𝒘& (𝒙)Y +743 𝝈'W𝒇𝒘& (𝒙)Y ⊙ 𝛜 with 𝛜 ∼ 𝒩(0, 1). The latent signal 𝒛 is fed to a decoder, whose role is to predict the 744

Reference

signal 𝒓. Both blocks 𝜇(⋅) and 𝜎(⋅) are implemented as a single linear layer. 745 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint 746 Maximum likelihood optimisation 747 In order to address the maximisation of likelihood (2), we adopt a variational formulation that relies on the 748

Introduction

of a decoding network that implements at every time-step 𝑡 = 1, … , 𝑇 the PCC distribution 749 𝑞-# (+$,+") (𝒓& || 𝒛&) = 𝑞𝒘# @𝒓& & *+" |𝒓& , ( , 𝒛&, +$ & A 750 (3) 751 between latent representation 𝒛 and reference signal 𝒓 [23], [61]. The decoder (3) is causal, with memory 752 given by integer 𝜏* > 1 and is parametrised by a vector 𝒘;. 753 In practice, we implement the decoder as a model 𝑞𝒘( \𝒓! ! 2/$ g𝒓! – ) , 𝒛!( /) ! ` : = 𝑞𝒘( (𝒓! !2/$ | 𝒛!( /) ! ), where 754 𝑞𝒘( (𝒓! !2/$ | 𝒛!( /) ! ) is implemented by the output layer of an ANN with inputs given by a window 𝒛!( /) ! of 755 samples from the latent representation 𝒛. Alternatively, one can use an RNN or a probabilistic SNN [62] to 756 directly model the kernel 𝑞𝒘( \𝒓! ! 2/$ g𝒓! – ) , 𝒛!( /) ! `. 757 With such a network, we use a standard variational inequality [60] to lower bound the decoder log-likelihood 758 at each step 𝑡 = 1, … , 𝑇 759 log 𝑝W𝒓= /$ X|𝒙=Y ≥ 𝔼*𝒘& >𝒛+,-) + | 𝒙+B[log 𝑞𝒘( (/),/$) (𝒓! ||𝒛!)] + KL(𝑝𝒘& (𝒛! | 𝒙!) || 𝑝(𝒛)). 760 Averaging over time-steps, we obtain the variational predictive likelihood 761 ℒ𝒘(𝒙, 𝒓) = 𝔼*𝒘&(𝒛||𝒙) s1 𝑇 t log 𝑞𝒘( (/),/$) (𝒓!||𝒛! ) - !9) u− 1 𝑇 KL(𝑝𝒘& (𝒛||𝒓) || 𝑝(𝒛)) 762 (4) 763 where 𝒘 = (𝒘7, 𝒘𝒅), that we maximise via stochastic gradient descent (SGD) and the Adam optimiser [63] 764 using Monte Carlo gradients. 765 766 The gradient with respect to the encoder and decoder weights can be obtained via the stochastic gradient 767 variational Bayes (SGVB) technique [20] as 768 ∇𝒘ℒ𝒘(𝒙, 𝒓) = 1 𝑇 t 𝔼*(𝒙+,𝒓𝒕/𝝉𝒇)𝔼*(𝝐) w∇𝒘log 𝑞𝒘( \𝒓𝒕 𝒕2𝝉𝒇 g𝝁W𝒇𝒘𝒆 (𝒙𝒕)Y + 𝝈'W𝒇𝒘𝒆 (𝒙𝒕)Y ⊙ 𝛜 `x - !9) . 769 (5) 770 Unbiased estimates of the gradients can be evaluated by using mini-batches ℬ ⊆ 𝒟 from the data set 𝒟, 771 along with random samples 𝛜(G) ∼ 𝒩(0, 𝐼"2 ) for 1 ≤ 𝑙 ≤ 𝐿 with 𝐿 ≥ 1. The gradients ∇𝒘 w.r.t. the 772 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint encoder and decoder weights are computed using backpropagation and automatic differentiation tools. 773 774 Online optimisation 775 The computation of gradients (5) is expensive as it requires to apply the backpropagation-through-time 776 algorithm, which scales quadratically in terms of memory consumption with the number of time-steps. To 777 alleviate memory issues during optimisation when the recording duration is large, we propose to use the 778 following approximation to perform SGVB updates at every time-step 779 ∇𝒘ℒ𝒘 \𝒙!, 𝒓𝒕 𝒕2𝝉𝒇 ` ≈ 𝔼 *H𝒙+, 𝒓𝒕 𝒕/𝝉𝒇 I 𝔼*(𝝐) w∇𝒘log 𝑞𝒘( \𝒓𝒕 𝒕2𝝉𝒇 g𝝁W𝒇𝒘𝒆 (𝒙!)Y + 𝝈'W𝒇𝒘𝒆(𝒙!)Y ⊙ 𝛜 `x, 780 which ignores the contribution of derivatives from previous time-steps 𝑡J < 𝑡. 781 782 Hebbian Self-Attention 783 We propose to apply online STDP at every time-step 𝑡 to compute a matrix 𝑨! ∈ ℝ"3× "3 in which each 784 element 𝑎KL,! defines an attention coefficient between sequences 𝑖 and 𝑗. We consider an online STDP kernel, 785 whereby coefficients for each synapse (𝑖 → 𝑗)) MK,LM" are obtained by maintaining over time a pair of pre- and 786 post-synaptic eligibility traces 𝑎KL,! NOP and 𝑎KL,! NQR= where 𝑡 ∈ {1, … , 𝑇} and N is the number of recorded neurons, 787 Without loss of generality, we assume 𝑇 to be the temporal horizon of the task. When neurons on either side 788 of the synapse fire, the corresponding traces are updated by an amount determined by learnable weights 𝒘KL,! NOP 789 and 𝒘KL,! NQR=, and exponentially decay in the absence of a spike by a value determined by scalars 𝜏KL NOP and 𝜏KL NQR=. 790 The STDP coefficient 𝑎KL,! is potentiated (resp. decreased) by the value of the pre- (resp. post-) synaptic trace 791 when the post- (resp. pre-) synaptic neuron fires. Mathematically, we obtain the coefficient 𝑎KL,! as 792 𝑎KL,! ∶= 𝑎KL,!( ) + 𝑠K,! ⋅ 𝑎KL,! NQR= − 𝑠L,! ⋅ 𝑎KL,! NOP, 793 where 𝑠K,! ∈ {0, 1} denotes the state of neuron i at time t, and 𝑎KL,! NOP and 𝑎KL,! NQR= are pre- and post-synaptic traces 794 defined as 795 𝑎KL,! NOP ∶= 𝑎KL,!( ) NOP ⋅ 1 − 𝑑𝑡 𝜏KL NOP + 𝑤KL NOP ⋅ 𝑑𝑡 ⋅ 𝑠K,! 796 and 797 798 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint 𝑎KL,! NQR= ∶= 𝑎KL,!( ) NQR= ⋅ 1 − 𝑑𝑡 𝜏KL NQR= + 𝑤KL NQR= ⋅ 𝑑𝑡 ⋅ 𝑠L,!. 799 800 All values 𝜏KL NOP, 𝜏KL NQR=, 𝑤KL NOP and 𝑤KL NQR= are learnable parameters for each synapse 𝑖 → 𝑗. 801 802 Finally, the output of the attention layer is obtained as 803 HebbianAttention(𝑨!, 𝑽) = 𝑨! . 𝑽, 804 where 𝑽 ∈ ℝ"3 × ;4 is a learnable weight matrix defined in analogy to values in conventional dot-product 805 attention [17]. 806 807 The attention mechanism can further be expanded to account for several versions of the parameter vectors 808 𝚻NOP,(S) ≔ \𝜏KL NOP,(T) ` KL , 𝚻NQR=,(S) , 𝑾NOP,(S), 𝑾NQR=,(S), and 𝑽(S) for 𝑘 = 1, … , 𝐾, with 𝐾 being the number of 809 attention heads. The resulting multi-headed attention module allows the model to jointly attend to 810 information from different representation subspaces at different positions [17], and can be obtained as 811 MultiHeadedHebbianAttention! 812 = 𝑾𝒐 - Concat(HebbianAttention(𝑨! () ), 𝑽() )), … , HebbianAttention(𝑨! (V), 𝑽(V)), 813 where 𝑾𝒐 ∈ ℝV;4× V;4 is a learnable parameter and Concat(⋅) denotes the vertical concatenation of 814 matrices. 815 816 Calcium recordings 817 When considering calcium recording data, we propose to directly use the calcium traces of each neuron as 818 eligibility traces, which removes the heavy processing that comes with spike deconvolution. Mathematically, 819 given inputs 𝒙, we have 820 𝑎KL,! NOP ∶= 𝒙K,!, 821 𝑎KL,! NQR= ∶= 𝒙L ,! 822 and 823 𝑎KL,! ∶= 𝑎KL,!( ) + 𝑎KL,! NQR= − 𝑎KL,! NOP. 824 825 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint Mouse MEC dataset 826 Briefly, activity from MEC cells of head-fixed mice allowed to run on a rotating wheel in darkness was 827 recorded using either two-photon calcium imaging or Neuropixels probes [1]. We consider a single example 828 recording provided by the authors at https://github.com/soledadgcogno/Ultraslow-oscillatory-829 sequences/tree/main, which included 464 neurons recorded over 3600.51 seconds at a rate of 7.73 Hz. 830 Following the authors’ methodology [31], we downsampled the recording by a factor of 4 down to 1.93 Hz. 831 For the prediction task shown in Figure 2, we trained both SPARKS and another autoencoder with the same 832 architecture but implementing the alternative symmetric Hebbian rule on four segments lasting 500 seconds 833 each, the first starting at 600 seconds in the recording, and leaving one out for testing. Results are evaluated 834 numerically through the binary cross-entropy 835 BCE = t 𝑟W,! ⋅ log 𝑞𝒘( (𝑟! | 𝒛!( /) ! ) + (1 − 𝑟W,!) ⋅ log(1 − 𝑞𝒘( (𝑟! | 𝒛!( /) ! )) W,! . 836 We further evaluated the quality of the reconstructions as per the oscillatory score proposed in [31]. We show 837 in Figure S 2 the PSD of reconstructions, evaluated using Welch’s method with Hamming windows of 66 s 838 (128 bins of 516 ms in each window) and 50% of overlap between consecutive windows. The horizontal lines 839 correspond to the threshold used to evaluate the presence of oscillations [31], which amounts to nine times 840 the average value of the tail of the PSD after the first peak. 841 Next, we trained SPARKS on the whole recording, starting from 600 seconds in the session as we could not 842 find oscillatory sequences prior to that. From the 2-dimensional latent embeddings hence obtained, we 843 recovered a phase signal by computing at every time-step 𝑡 the angle between a segment joining the current 844 embedding 𝒛𝒕 to the centre of the ring topology, and another joining the embedding 𝒛! to a fixed direction 845 (here, the horizontal, Figure S 2). The phase signal we hence obtain is shown in Figure S 2. 846 We sorted neurons by ordering them according to their correlation to the phase signal. 847 Finally, we evaluate the quality of each sorting by population decoding of the sorted signals (Figure S 2) and 848 plotting their PSDs (Figure S 2). We obtain the population decoding 𝒙 of spiking signal 𝒙 as 849 𝑥! = 1 𝑁 t 𝒙K,!. K 850 Similarly to the oscillatory score proposed in [31], we evaluate the quality of each sorting by measuring the 851 height of the peak above 0 in the PSD. 852 853 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint Macaque reaching task dataset 854 The dataset corresponds to a single session extracted from [32]. Briefly, electrophysiological recordings were 855 taken from Area 2 of the somatosensory cortex (S1) of a rhesus macaque during an eight-direction center-out 856 reaching task. We consider a session obtained from the Neural Latents Benchmark [33] to train our model. 857 We use 1 ms bin times, and consider recordings from −100 ms to 500 ms after movement onset. Similarly to 858 [33], we consider a 40 ms lag between neural data and movement to align trials. We train SPARKS by 859 considering an encoder with no conventional attention layers, and a 64-dimensional embedding. The decoder 860 is chosen as an MLP with layer dimensions 𝑑 × 𝜏* − ; × /)2 ' × /$ ' − 𝑁X × 𝜏+, where 𝑁X corresponds to the 861 dimension of the reference signal 𝒓. Models were trained for 500 epochs. Results presented for unsupervised 862 learning were obtained by choosing 𝜏+ = 10. 863 The alternative model using conventional scaled dot-product attention operates using firing rates smoothed 864 over a 200 ms window, which was chosen after hyperparameter search. 865

Results

from [16] were shared by the authors at https://github.com/AdaptiveMotorControlLab/cebra-figures. 866 867 Allen visual coding dataset 868 The Allen visual coding dataset comprises two-photon calcium imaging and Neuropixels data recorded from 869 the primary visual cortex of mice during presentation of video scenes, and is available to download from the 870 Allen Institute [35]. We consider units sampled at random from all sessions in the Allen visual coding 871 dataset, and stack the raw spike trains to perform movie reconstruction, frame prediction or unsupervised 872 learning. 873 Each presentation of the excerpt from the movie Touch of Evil, performed at a rate of 30 kHz and lasting 874 30 s, were downsampled with a bin size of 6 ms to speed up training. Calcium recordings were kept to their 875 initial rate of 30 Hz, matching the film frame rate. For frame reconstruction, movie frames 𝐹 ∈ ℝYZZ × [ × , 876 were spatially downsampled to obtain film tensor 𝐹 ∈ ℝYZZ × [5× , 5 to reduce the computational load of the 877 technique. 878 The 1-second frame prediction accuracy was computed as 879 1 𝑇 t 1(argmaxW𝑞(𝑟! |𝑧!, 𝑥!)Y ∈ [𝑡 − 6, 𝑡 + 6]), - !9) 880 Where 1(⋅) is the indicator function that outputs 1 if its argument is true, and 0 otherwise; and the argmax 881 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint function outputs the largest index of its vector argument. 882 Distances between movie frames were computed as 𝑑𝑖𝑠𝑡W𝑓K, 𝑓LY = 1 − 𝑐𝑜𝑟𝑟(𝑓K, 𝑓L) for every pair of frames 883 W𝑓K, 𝑓L Y) MK,LMYZZ where 𝑐𝑜𝑟𝑟(⋅,⋅) is the 2-dimensional correlation normalised to [0, 1]. 884 Distances between the neural activity was computed from the tensor 𝑆 ∈ ℝ)Z × YZZ × )ZZZ × \\ of spikes over 885 the 10 movie repetitions, for the 900 frames and subsampled at 1kHz, corresponding to 33 time-steps per 886 frame. 887 The average activity per frame is computed as 𝑆+ = ∑ 𝑆W+S!W,! . Instantaneous distances between frames are 888 then computed as the 𝐿2 vector distance 𝑑𝑖𝑠𝑡 \𝑆+6 , 𝑆+] ` = ©∑ \𝑆+6S − 𝑆+7S ` ' S and normalised to [0, 1]. 889 890 Functional Hierarchy of Mouse Visual Cortical Areas 891 We consider static grating presentations from the Allen visual coding dataset. Gratings were presented for 892 250 ms in a random order, with spatial frequencies equal to 0.02, 0.04, 0.08, 0.16 and 0.32 cycles per 893 degree. We consider all presentations with a fixed orientation of 150 degrees, and all spatial phases, resulting 894 in 30 training examples and 10 test examples per spatial frequency. Examples were downsampled with a bin 895 size of 1 ms. For each repetition, we used 700 single unit recordings from each visual cortical area (VISp, 896 VISpm, VISam, VISl, VISal and VISrl), which corresponds to the lowest number of neurons collected in a 897 single area across recordings. 898 We train SPARKS via supervised learning to predict the spatial frequency of the gratings at every time-step, 899 i.e., for all 𝑡 = 1, … , 250, we set 𝑟! ∈ {0.02, 0.04, 0.08, 0.16, 0.32}. Error at time 𝑡 for spatial frequency 𝑓 900 and cortical area 𝑖 was computed as the L1 distance 901 𝑒K,! + = |𝑓 − 𝑞𝒘( (𝑟! | 𝒛!( /) ! )|. 902 Projections from cortical area 𝑖 to cortical area 𝑗 were computed as 903 projK→L = t 1W𝑒K,! + < 𝑒L,! + Y ⋅ |𝑒K,! + − 𝑒L,! + | +,! . 904 Code availability 905 Code: https://www.github.com/FrancisCrickInstitute/sparks (coming soon). 906 907 908 909 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint

Acknowledgements

910 Primary funding for this project was provided by the Biotechnology and Biological Sciences Research 911 Council (BBSRC, award ref. BB/X013472/1) and the Francis Crick Institute (award ref. CC2118). The 912 authors wish to thank O. Simeone, A. Schaefer, S. Tootoonian and M. Kollo for their discussions and 913 comments on the manuscript. 914 915 Author contributions 916 F.I. and S.S. co-supervised the work. N.S. developed the algorithm, wrote the code, performed the 917 simulations and wrote the manuscript. N.G. refined the algorithm for calcium data. F.I. and S.S. revised the 918 manuscript. 919 920

Materials

& Correspondence 921 Requests should be addressed to N. Skatchkovsky. 922 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted August 16, 2024. ; https://doi.org/10.1101/2024.08.13.607787doi: bioRxiv preprint 923 Figure S 1(A) Detailed illustration of the computation of the Hebbian attention coefficients via STDP. (B) Schematic of 924 the conventional dot-product attention layer, adapted from [17]. 925 926 !!",%&'(=0!!",%&)*+=0!!",%=0!!",,&'(=!!",%&'(⋅1−,-! +0⋅)!"&'( !!",,&)*+=!!",%&)*+⋅1−,-! +*⋅)!"&)*+ !!",,=!!",%−*⋅!!",,&'(+0⋅!!",,&)*+ !!",.&'(=0+*⋅)!"&'( !!",.&)*+=)!"&)*+⋅1−,-! +0 !!",.=0−0⋅!!",.&'(+*⋅!!",.&)*+ !!",/&'(=)!"&'(⋅1−,-! !!",/&)*+=)!"&)*+⋅1−,-! .+*⋅)!"&)*+ !!",/=!!",.−*⋅!!",/&'(+0⋅!!",/&)*+ ‘pre-syn.’ neuron +,-=1)!"&'(&&',4*+,&&',4*-./‘post-syn.’ neuron j--&&',4 -)!"&)*+AInputsInput embedding+%+%3%4MatMulMatrix multiplicationScale &SoftMaxPositional encodingBSTDP coefficient 927 Figure S 2 (A) Power spectral density (PSD) of the phase of the oscillation of the signal, as defined in [31]. (B) Latent 928 embeddings obtained with the alternative symmetric rule. (C) Latent embeddings obtained with non-learnable parameters 929 𝜏34!56,𝜏34!7$&,𝑤34!56 and 𝑤34!7$& for all synapses 𝑖→𝑗. (D) Illustration of the angle used to recover the phase of the 930 oscillations from the latent embeddings. (E) Phase signal recovered from the latent embeddings. (F) Oscillatory signals 931 recovered by population decoding of the spike trains with different sortings of the neurons. (G) Corresponding PSDs. 932 933 934 935 936 937 Figure S 3 (A) Example latent embeddings obtained from training models to decode the position of the animal’s hand. 938 From left to right: Alternative model with conventional dot-product attention, VAE with a traditional ANN encoder, VAE 939 with a recurrent neural network (RNN) encoder. (B) Comparison of 𝑅# scores for prediction of hand position. Orange 940 lines represent the median over 5 runs, and each dot an individual run. From left to right: SPARKS with a 3-dimensional 941 latent space, alternative model with conventional dot-product attention for a 3-dimensional latent space, CEBRA with a 942 16-dimensional latent space, VAE with a traditional ANN encoder for a 3-dimensional latent space, VAE with a recurrent 943 neural network (RNN) encoder for a 3-dimensional latent space. (D) Prediction 𝑅# scores for unsupervised learning 944 across prediction window length 𝜏". Shaded areas represent standard deviation over 5 individual runs. (E) Similarity 𝑅# 945 scores for supervised learning across different reference signals. (F) Consistency vs. prediction 𝑅# scores for the same 946 target variables. Each value represents the mean over 5 individual runs. (G) Consistency of latent embeddings for 947 supervised learning across hyperparameter values. r: Pearson correlation coefficient. p: p-value, t-test. 948 949 950 Figure S 4 (A) 1s decoding accuracy as a function of number of neurons in a session. Error bars represent standard 951 deviation across 5 repetitions. r: Pearson correlation coefficient. (B) Latent embeddings averaged across examples from 952 blocks 1 and 2. (C) 1s decoding accuracy as a function of number of randomly sampled neurons. Error bars represent 953 standard deviation across 5 repetitions. (D) Normalised Euclidian distance between the average firing rate per frame in 954 block 1 across 1,000 neurons. (E) Similarity 𝑅# scores between the latent embeddings obtained for different example 955 sessions, averaged over 5 repetitions and 10 movie presentations from block 1. (F) Evolution of the 1s decoding accuracy 956 during fine-tuning on a single session, normalised to the decoding accuracy of a model trained on this session only. (G) 957 Latent embeddings averaged across examples from blocks 1 for 5 example sessions. 958 AB C Block 1 framesBlock 2D Session 0Session 1Session 2Session 3Session 4 E GF

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-21T05:10:58.409756+00:00

License: CC-BY-NC-ND-4.0