Abstract
11
12
Humans readily extract statistical regularities from experience, yet natural environments require 13
flexible adaptation when associative structures shift across changing contexts, often without 14
warning. Across two experiments, we show that humans can incidentally learn overlapping and 15
conflicting visual associations even when contexts dynamically alternate and remain unsignaled 16
or only minimally cued. To probe the computational mechanisms supporting this adaptive 17
capacity, we trained recurrent neural networks with gated recurrent units on the same statistical 18
learning task without providing any explicit context information. These models spontaneously 19
developed distributed internal representations that robustly separated conflicting associations and 20
supported rapid adaptation to latent context shifts. Critically, we show that these distributed 21
representations, strongly shaped by the model’s initial weight configuration, played a key role in 22
preventing catastrophic interference between contexts. Together, these behavioral and 23
computational results significantly advance our understanding of how humans and artificial 24
systems can successfully learn and flexibly retrieve context-dependent associations under 25
challenging conditions. 26
27
28
Introduction
29
30
Many everyday experiences unfold in structured, predictable ways, with events that recur over 31
time in stable patterns. Internalizing these regularities allows anticipation of future occurrences, 32
facilitating efficient information gathering, decision-making, and behavioral adaptation. It follows 33
that the human brain is fundamentally oriented toward predicting the upcoming future based on 34
recent events.1–3 This predictive ability helps conserve cognitive resources by reducing the need 35
for continuous, effortful learning once patterns have been identified.4 However, the world is rarely 36
static: associations often vary across contexts.5 To support adaptive behavior, the brain is 37
thought to engage in context-dependent learning of these regularities and associations for flexible 38
predictions as environmental conditions shift.6 For example, navigating a daily commute relies on 39
learning the timing and location of traffic congestion, and expectations for social interaction may 40
differ when a friend is encountered at work versus at a party. In both cases, prior experience 41
supports the formation of context-bound predictions that guide perception and behavior. 42
43
Humans have an innate ability for statistical learning, allowing them to spontaneously discover 44
regularities and associations. This process extracts spatial and temporal regularities from sensory 45
input through passive exposure, without explicit instruction or external rewards.7 Statistical 46
learning is proposed to support a wide range of cognitive functions, including language 47
acquisition, visual perception, object recognition, and social cognition.9,10 Empirical studies 48
demonstrate that individuals can detect regular patterns in continuous streams of stimuli across 49
visual,11 auditory,12 and tactile13 modalities, in the absence of explicit transition cues and 50
instructions. 51
52
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
2
Despite the rich literature on statistical learning, most research has focused on simple, highly 53
reliable associations, such as detecting short sequences of objects or sounds. However, in 54
natural environments, context plays a critical and often unobserved role in shaping how 55
associations are formed. In animal learning research, for example, association-based behavior is 56
known to be highly context-specific: extinguished fear responses return when animals are tested 57
outside the extinction setting.14 Moreover, following extinction or reversal learning, animals 58
reacquire original contingencies more rapidly than during initial learning,15,16 suggesting that prior 59
contingencies are retained as latent knowledge in memory rather than being overwritten by new 60
learning. Cognitive control processes are thought to underpin the behavioral flexibility afforded by 61
suppressing previously useful but no longer relevant responses, allowing learners to pivot 62
between contexts and contingencies as the environment demands.17 Notably, most studies in 63
animal learning literature involve explicit reinforcement (e.g., reward or punishment), whereas 64
statistical learning occurs incidentally without feedback, instruction, or overt motivation. 65
66
Although a few studies have explored the statistical learning of regularities that depend on a 67
latent context or environment (e.g., 18–22), it remains unclear whether individuals can incidentally 68
learn and retrieve context-dependent temporal associations without explicit perceptual context 69
cues, reinforcement, or instruction. Analogous mechanisms have been proposed in sensorimotor 70
learning, an instance of implicit learning where the brain is thought to infer context shifts and 71
partition experience into distinct memories.23 Here, we test whether people can acquire two 72
distinct sets of temporal associations instantiated with an overlapping pool of visual objects, 73
where most associations are in direct conflict between contexts. For example, in Context A, 74
Object X is followed by Object Y, whereas in Context B, the same Object X is followed by Object 75
Z. Successful learning requires participants to flexibly update their expectations according to the 76
active context inferred from recent sequence history. We examine how well human learners can 77
discover these context-dependent associations without any external context cue – where context 78
is embedded only in the pattern of transitions – using both offline testing and online learning 79
measures. 80
81
To explore how these context-dependent representations might emerge from experience, we 82
trained neural network models on the same behavioral task. We then identified the model that 83
best matched human performance across the experimental conditions and analyzed its hidden-84
layer activations to generate testable hypotheses about analogous representations in the human 85
brain. Deep neural networks have proven effective at capturing lower-level sensory processing,24 86
and recent perspectives advocate for extending these approaches to the study of higher-order 87
cognition, including the representation of abstract knowledge.25 However, a common limitation of 88
these modeling efforts is that these networks are typically trained on far more data than human 89
learners (see 26 for a review), limiting the validity of direct comparisons. Additionally, prior 90
modeling work frequently incorporates strong inductive biases that render context artificially 91
explicit, either by feeding an unambiguous context signal into the input22,27 or by augmenting 92
network architecture with designated units or computation modules.28,29 These modifications, 93
while effective, constrain opportunities to observe how context discovery might emerge 94
spontaneously. Inspired by Elman’s finding that simple recurrent neural networks can capture 95
both short- and long-range dependencies30 and echoing recent calls to avoid hard-wiring 96
solutions in cognitive modeling,31 we used minimally structured architectures that omitted context 97
signaling and specialized inference modules. This design allowed us to examine how networks 98
discover and represent latent task structure through sequence exposure alone. Finally, given 99
evidence that weight initialization scale can influence learning trajectories in neural 100
networks27,32,33, we systematically varied the initial weight magnitudes of the networks to assess 101
how this factor affects their ability to learn and distinguish context-dependent associations. 102
103
The goal of the neural network modeling is to generate hypotheses about how the brain might 104
represent context in statistical learning. The hippocampus represents two dominant neural 105
representation strategies to support memory of individual experiences and to extract regularities 106
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
3
across experiences34,35: sparse and distributed coding. Sparse codes, observed in the dentate 107
gyrus and CA2/3 subregions, involve highly selective activation of a small subset of units in 108
response to a given unit.36 Distributed representations, observed in the CA1 subregion, encode 109
inputs across overlapping patterns of activity spanning the neural population.37 We specifically 110
seek to find evidence of each of these strategies in the hidden layer activations of neural 111
networks that successfully represent context-dependent associations. 112
113
Overall, this study aims to advance our understanding of context-dependent statistical learning by 114
examining whether humans can learn and retrieve multiple conflicting statistical structures within 115
highly overlapping stimulus sets. By manipulating the presence of visual contextual cues, we 116
assess whether explicit signals of context shifts facilitate learning and whether individuals can still 117
learn context-dependent associations in their absence. In parallel, we use neural network models 118
trained on the same task to test whether artificial systems can account for human-like learning 119
dynamics, offering insight into the computational mechanisms that may support flexible, context-120
sensitive learning in the brain. 121
122
Results
123
124
Participants performed a context-dependent statistical learning task in which they viewed a 125
continuous stream of 1,600 object images (Fig. 1A). Their only task was to indicate whether an 126
“×” or “+” was embedded on each object (Fig. 1C), a perceptual judgment designed to maintain 127
attention and allow tracking of online learning via reaction times (RTs). Unbeknownst to 128
participants, the image stream was structured into object pairs specific to one of two distinct 129
contexts. Although they were told that parts of the sequence might become familiar over time, 130
they received no information about the underlying structure or the existence of multiple contexts. 131
Each context defined a unique set of temporal associations between a largely overlapping object 132
set, such that the probability of one object following another depended on the active context (Fig. 133
1B). 134
135
Following the learning phase, participants completed a two-alternative forced choice (2AFC) test. 136
Because objects appeared in both contexts, the correct association on a given trial depends on 137
the active context. Accordingly, each test trial began with a six-object sequence composed of 138
three object pairs from a single, consistent context, followed by the first item of a test pair (Fig. 139
1E). Participants were then tasked with choosing which of two objects should come next (Fig 1F). 140
Context-independent trials assessed knowledge of the context-independent pair. Context-141
dependent trials consisted of two types: in direct-conflict trials, the lure was the object paired with 142
the test cue in the other context; in indirect-conflict trials, the lure was an object not paired with 143
the test cue in either context. After each choice, participants rated their confidence on a 1-4 scale 144
(Fig. 1G). 145
146
In Experiment 1 (Unsignaled), n = 50 participants completed the task without any explicit perceptual 147
context cue. In Experiment 2 (Signaled), a separate group of n = 50 participants completed the 148
same task but with a visual context cue: a colored border (white or black) surrounding each object, 149
corresponding with active context (Fig. 1D); this border was present during both the learning phase 150
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
4
and the 2AFC test. Accuracy across the learning phase for the perceptual task was 92.2% for Expt. 151
1 and 92.7% for Expt. 2, indicating that participants attended to the stimuli during learning. 152
153
154
Figure 1. Experimental overview. (A) Visualization of the learning phase. Participants viewed a 155
uniformly paced sequence of objects separated by brief fixation periods. Each object appeared for 156
1200ms with a 450ms interstimulus interval. The sequence was organized according to the 157
temporal pair structure dictated by one of two contexts (Context A and Context B), which switched 158
every 50 pairs. The orange and green backdrops are shown for illustrative purposes only. 159
Participants performed four blocks of 200 pairs each, separated by short breaks. (B) Sample object 160
assignments to context pair structures comprising 11 unique objects. The context-independent pair 161
is the same for both contexts as shown in the first row, three of the context-dependent pairs consist 162
of the same object set with pair assignment of the second pair position different for each context 163
as shown in rows 2-4, and one context-dependent pair consists of a context-specific object in the 164
second pair position as shown in the last row. (C) Example of object embedded with “+” or “×”. 165
Participants were tasked with making a button-press response to indicate which symbol each object 166
contained; object-symbol mapping was held constant throughout the experiment. (D) Differentiation 167
of the two experiments: In Expt. 1 (Unsignaled), no context cues were shown and thus context 168
switches were entirely latent (left); in Expt. 2 (Signaled), context was indicated with a white or black 169
border around the object (right). (E) 2AFC test procedure: Example of 6 -item (3-pair) sequence 170
leading up to the test cue of a 2AFC trial. (F) Immediately following the test cue, participants chose 171
which of two candidate objects comes next in the sequence. This example is a direct -conflict 172
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
5
context-dependent trial, in which the lure corresponds to the object paired with the test cue in the 173
other context. (G) After each choice, participants made a confidence rating. 174
175
Behavioral evidence of context-dependent statistical learning 176
We observed evidence of context-dependent statistical learning with significant 2AFC performance 177
for both contexts (one-sample t-tests, Holm-Bonferroni corrected for three tests, all p < 0.001) (Fig. 178
2). When considering direct- and indirect-conflict 2AFC trials separately, we found above-chance 179
accuracy for both trial types (all one-sample t-tests p < 0.05; SI Appendix, Fig. S1). A mixed-design 180
ANOVA was conducted to examine the effects of experiment (Unsignaled vs. Signaled context, 181
between-subjects) and context -dependence (context -independent vs. context -dependent trials, 182
within-subjects) on 2AFC accuracy. There was a significant main effect of context -dependence 183
(F(1, 98) = 17.52, p < 0.001), reflecting higher performance on context-independent than context-184
dependent trials. The main effect of experiment was not significant (F(1, 98) = 1.93, p = 0.17), nor 185
was the interaction between experiment and context-dependence (F(1, 98) = 3.07, p = 0.08). We 186
used Bayesian estimation to assess equivalence of context -dependent 2AFC accuracy between 187
experiments. The posterior distribution of the mean difference was centered near zero (mean = 188
0.38%, 95% HDI [-4.1, 4.6]). Approximately 96.5% of the posterior mass fell within the predefined 189
region of practical equivalence (ROPE) of [-5%, 5%], providing evidence that the two experiments 190
yielded equivalent performance. This equivalence suggests that the border cue may have been too 191
subtle to boost context -dependent learning or that explicit contextual cues are unnecessary to 192
foster context-dependent learning beyond the contextual information that can be ascertained from 193
recent sequence history in this paradigm. Additional analyses of confidence ratings and 194
performance on remaining test tasks are reported in SI Appendix, section S1. These analyses 195
reveal that most participants showed no explicit awareness of the temporal pair structure. 196
197
198
199
Figure 2. 2AFC test performance. Bar height reflects group average 2AFC accuracy (% correct) 200
for Context A questions (left bar, orange), Context B questions (middle bar, green), and context -201
independent questions (right bar, gray). Note that Context A and Context B correspond to the first 202
and second contexts, respectively, used during the learning phase. Each dot reflects the accuracy 203
for one participant with lines connecting a participant’s performance across the two contexts. 204
Results
plotted separately for Unsignaled and Signaled conditions on the left and right, respectively. 205
Asterisks indicate significant deviation from chance performance (50%; horizontal line). ***p<0.001. 206
207
We also found evidence of an online learning effect using participants’ reaction times during the 208
learning phase when they judged whether each object contained an “×” or “+” (Fig. 1C). Because 209
these markers were consistently associated within corresponding objects across learning, faster 210
responses could reflect memory -based predictions about the identity of upcoming objects 211
consistent with rapid adaptation to temporal statistics in the sequence . We expect that over the 212
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
6
course of learning, knowledge of the temporal pair structure would facilitate faster, anticipatory 213
responses to the second item of each pair than the first item of each pair, which follows a random 214
transition between pairs. Based on evidence that RTs improve throughout an experiment 38, we 215
measure online learning as the second item RT subtracted from the first item RT, where a positive 216
value indicates an anticipation effect, and a negative value reflects possible interference from 217
context switches. Mean reaction times for each pair position are reported in SI Appendix, Table S1. 218
219
220
Figure 3. Reaction time differences reveal trajectory of online learning. Reaction time (RT) 221
difference between responses to objects in the first (item 1) and second (item 2) pair position. A 222
positive value on the y-axis shows anticipation effect plotted for each block during learning phase 223
(x-axis). Average RT difference with standard error of the mean (shaded) for context-independent 224
pairs in gray and for context-dependent pairs in blue. Linear trend significance indicated with same 225
color scheme. Linear contrast significance indicated, ***p<0.001; **p<0.01; *p<0.05. The shaded 226
areas indicate sampling error. 227
228
For context-independent pairs (Fig. 3; gray), we found a significant linear trend of RT differences 229
across blocks in the Unsignaled experiment (t(49) = 2.70, p = 0.009), suggesting increasing 230
anticipatory learning over time. However, no such trend was observed in the Signaled experiment 231
(t(49) = 0.70, p = 0.49), where RT differences appeared to stabilize after the first block. For 232
context-dependent pairs, both experiments showed a significant linear increase in RT difference 233
across blocks (Unsignaled: t(49) = 4.36, p < 0.001; Signaled: t(49) = 2.59, p = 0.013). However, 234
unlike the context-independent pairs, RTs for the predictable, item 2 objects in the Unsignaled 235
experiment were initially slower than the first, unpredictable items (negative RT effect) and 236
approached equivalence by the final block. This slowing earlier in the experiment may reflect 237
interference from frequent context switches: participants had to suppress the prediction under the 238
previously active context, which would be especially demanding during the early blocks of 239
training. This effect is slightly ameliorated in the Signaled experiment, suggesting that participants 240
may have been able to integrate the border contextual cue to facilitate online context-dependent 241
learning. Despite this initial disadvantage for second item responses, the online learning measure 242
increased over time, reaching its highest average in the final block. A mixed-design ANOVA on 243
the RT difference score with experiment as a between-subjects factor and block as a within-244
subjects factor showed no main effect of experiment (F(1, 98) = 1.48, p = 0.23). 245
246
Neural network weight initialization influences context-dependent learning 247
Having established that humans can spontaneously learn context-dependent associations from 248
exposure alone, we next turned to a computational account of this behavior using artificial neural 249
network models. Our goals were to test whether these models could similarly discover the task’s 250
latent structure without context cues and, critically, to characterize the nature of the emergent 251
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
7
representations that give rise to the context-dependent gating of associative predictions, 252
examining how specific model parameters shape this capacity. 253
254
First, we determined that recurrent neural networks with gated recurrent units (GRUs) learned the 255
task more effectively than other network architectures, including feedforward networks and 256
recurrent networks without gated units (see model comparison details in SI Appendix, section 257
S2). Next, we trained GRU models on the same amount of sequence exposure as human 258
participants. Models featured a 150-node hidden layer and were trained to predict the next item in 259
the sequence using one-hot encoded object representation for both inputs and outputs (Fig. 4A). 260
Critically, models received no explicit context information, requiring them to discover the latent 261
structure from sequence statistics to make accurate predictions. As with humans, learning was 262
evaluated with a 2AFC test. Model weights were frozen after training, and each 2AFC trial 263
presented the model with a series of seven objects (i.e. a three-pair sequence and a test cue). 264
The model then “selected” the next object in the sequence between two options, with its choice 265
determined by the object with the higher predicted probability. 266
267
268
Figure 4. GRU model’s 2AFC performance by weight variance. (A) Visualization of neural 269
network architecture comprised of 11 input units, a single GRU layer with 150 units, and 11 270
output units. (B-D) 2AFC accuracy (y-axis) on context-dependent test trials for GRU models with 271
weights initialized with increasing variance along the x-axis color-coded by question category. (B) 272
2AFC performance on Context A (orange), Context B (green), overall context-dependent (blue) 273
and context-independent (red). Chance performance (50%) indicated with gray horizontal line. (C-274
D) 2AFC performance on individual contexts, visualized for overall as well as direct-conflict (dark 275
coloring) and indirect-conflict (light coloring) trial subsets. Significant one-sample t-tests from 276
chance (Bonferroni-corrected for eight comparisons) indicated with horizontal lines at top of plot 277
color-coded in the same way. Mean human performance on direct-conflict trials of each context 278
indicated by the dashed horizontal black line. (C) Context A. (D) Context B. (E) Absolute 279
difference direct-conflict 2AFC performance between human group average and model group 280
average for each weight initialization configuration. Bar height reflects summed direct-conflict 281
2AFC performance absolute difference of Context A (dark orange) and Context B (dark green). 282
283
We systematically varied the bounds of the uniform distribution used to initialize model weights to 284
evaluate whether greater initial weight variance would accelerate convergence, motivated by prior 285
findings that initialization in neural networks can strongly influence learning dynamics.39,40 Low-286
variance initialization is commonly used as the default in neural networks. However, it remains 287
unclear whether this default choice affects a model’s capacity to learn latent structures in the 288
data. To address this, we systematically varied the weight initialization variance across a wide 289
range of values. For each weight variance initialization condition, we trained and tested 50 290
independent models and report the average performance. 291
292
Across weight initialization conditions, models with low to moderate initialized weight variance 293
achieve perfect accuracy on context-independent trials, demonstrating their ability to learn stable, 294
non-contextual associations (Fig. 4B). However, as variance of initial weights increases, 295
performance steadily declines, highlighting how excessive initial weight variance introduces 296
noise, disrupting the model’s ability to extract consistent patterns from the sequence. 297
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
8
298
The models’ learning of context-dependent associations – where context-specific conflicts must 299
be resolved – reveals more complex dynamics. The relationship between initialized weight 300
variance and context-dependent accuracy is non-monotonic, with the highest performance 301
demonstrated by models with moderate weight initialization variance within the range of (0.4-0.6) 302
(Fig. 4B). Low-variance models (0.08-0.2) demonstrate around 90% accuracy on Context B, the 303
context which the model was most recently processing at the end of training before model 304
weights were frozen, compared to below 60% accuracy for Context A, the context which was 305
previously learned and conflicted with the more recently learned associations. Increasing 306
initialized weight variance in the high-variance range (1.0-1.4) exhibits a steady decline in 307
accuracy for both contexts, indicating that their representations may be too diffuse or unstable, 308
preventing them from effectively differentiating between contexts. Notably, this non-monotonic 309
relationship between 2AFC accuracy and initialized weight variance was preserved when models 310
were trained using input representations that reflected the perceptual features of the objects (as 311
opposed to one-hot vectors), derived from computer vision models such as AlexNet (SI Appendix, 312
section S3). The same pattern held when the context-dependent pair that included a context-313
unique second item was excluded, such that all pairs comprised items that appeared in both 314
contexts (SI Appendix, section S4). This result helps rule out the possibility that recent exposure 315
to particular items served as a context cue, strengthening the interpretation that context is 316
inferred from recent sequence history. 317
318
Breaking down performance into direct-conflict and indirect-conflict trials reveals notable 319
differences in model learning. Direct-conflict trials (where the lure object is the correct answer for 320
the other context) are the most diagnostic test of context-dependent learning as they place 321
associations from different contexts in direct competition, making accurate performance 322
dependent on the use of contextual information to disambiguate the correct response. Only 323
models initialized with a weight variance of 0.6 achieved above-chance performance on Context 324
A direct-conflict trials (t(49) = 2.81, p = 0.004), though this came at the expense of reduced 325
though still significant accuracy on Context B direct-conflict trials (Fig. 4D). Low-weight models 326
perform near floor on Context A direct-conflict trials, rendering their high overall context-327
dependent accuracy misleading as it reflects only strong performance on indirect-conflict Context 328
A questions and mastery of Context B. In contrast, high-weight models show no advantage for 329
Context B direct-conflict trials, with performance on direct-conflict questions around chance for 330
both contexts. 331
332
To identify which weight initialization variance best approximated human-like behavior, model 333
performance on Context A and Context B direct-conflict trials was compared to human data. 334
Human accuracy was averaged across the Unsignaled and Signaled experiments as no 335
significant difference was observed between them. Fig. 4E visualizes the absolute difference in 336
2AFC performance between models and humans on direct-conflict trials for both contexts (human 337
performance indicated by the dashed black line in Fig. 4C, 4D). Initialized weight variance of 0.6 338
produced the clearest match to human performance, forming an elbow in the plot and achieving 339
above-chance accuracy across all question sets. 340
341
Distributed hidden layer representation strategy facilitates context-dependent learning 342
Having observed that models with initialized weight variance in the moderate range (such as 0.6) 343
successfully learned context-dependent associations without any explicit context signal as input, 344
we next examined how context information is encoded in the hidden layer activations of the 345
models. Context encoding could manifest as either a sparse representation (carried by a few 346
units) or a distributed representation (spread across many units). Therefore, we quantified two 347
complementary properties of the activations: the extent to which context sensitivity was localized 348
to a small subset of units (akin to individual “context cell” neurons that code for which context is 349
currently active), and the degree to which the currently active context was expressed a as 350
distinctive pattern of activity across many units. These analyses were conceptually motivated by 351
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
9
prior work distinguishing sparse and distributed coding strategies in hippocampal and 352
connectionist models.35,41 The analysis was conducted using the hidden-layer activations during 353
the final block of training. 354
355
The sparse representation index measures the proportion of hidden layer units that do not show 356
significant activation differences between contexts. A higher sparse representation index 357
therefore indicates that fewer units selectively encode a specific context, while a larger proportion 358
of units are context-insensitive (Fig. 5A). This measure of context sensitivity for each unit was 359
calculated with a one-way ANOVA comparing activations during exposure to Context A versus 360
Context B in the final quarter of training (block 4). Fewer nodes with significant context sensitivity 361
indicate that limited number of hidden-layer units support the context-specific representations. 362
363
The distributed representation index was derived from a representational similarity analysis of 364
hidden layer activations for the first item of each pair, the item that carries the context-dependent 365
association. The index compares the geometric distance (dissimilarity) of these representations 366
within a context versus across contexts, with normalization based on within-context consistency 367
so that more stable representations are given greater influence (Fig. 5A, right plot). Higher values 368
indicate that distinct context representations are distributed across the hidden layer. 369
370
We found that the low-variance models exhibit the sparsest representations, and the moderate-371
variance models exhibit the most distributed representations (Fig. 5A, middle plot). High-variance 372
models do not show strong evidence of either representation strategy. The 0.6-initialized model, 373
which was the only model to demonstrate significant direct-conflict 2AFC accuracy for both 374
contexts (Fig. 5B-C), exhibited the strongest evidence for the distributed over the sparse 375
representation index. 376
377
378
Figure 5. Neural network hidden layer task representation strategies. (A) Visualization of 379
computation of sparse representation index (left) and distributed representation index (right) 380
plotted for each weight variance configuration (middle; x-axis) in light green and blue, 381
respectively. (B) Significance of beta coefficients (y-axis) for multivariate regression analyses 382
using sparse (green) and distributed (blue) representation indices to predict each 2AFC question 383
category (x-axis). ***p<0.001; **p<0.01; *p<0.05. (C-D): Lesion analysis results. (C) 2AFC 384
accuracy of Context A (left) and Context B (right) questions (y-axis) as an increasing number of 385
hidden layer nodes are lesioned in descending rank order of context sensitivity (x-axis) for each 386
weight initialization configuration (rainbow coloring). (D) 2AFC accuracy performance difference 387
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
10
from no intervention for lesion analysis. (E) Context switch latency (y-axis) visualized across 388
learning phase blocks (x-axis) for each weight initialization configuration (rainbow coloring). 389
Failure to reflect a context switch is noted with a value of 51 (e.g., greater than duration of context 390
exposure). 391
392
To understand how these representation strategies supported learning, we regressed 2AFC 393
accuracy on the z-scored sparse and distributed representation indices (averaged across the 50 394
models initialized for each weight configuration; Fig. 5B). Including both predictors in the same 395
model allows us to evaluate their unique contributions to 2AFC task performance. Both indices 396
significantly predicted context-independent accuracy (Sparse: β = 0.080, p = 0.016; Distributed: β 397
= 0.081, p = 0.015). For context-dependent learning, the sparse representation index predicted 398
performance only for Context B (β = 0.17, p < 0.001) but not Context A (β = 0.007, p = 0.71), 399
consistent with its prominence in the low-variance models that disproportionately learned Context 400
B. In contrast, the distributed representation index predicted accuracy for both Context A (β = 401
0.069, p = 0.012) and Context B (β = 0.11, p < 0.001), reinforcing that this strategy more 402
effectively supports context-dependent learning, where successful learning requires retaining 403
knowledge of both contexts. 404
405
Efficient context switching facilitates expression of context-dependent knowledge 406
We next carried out a lesion simulation analysis to understand how the moderate-variance 407
models provide a better account of human behavior than the low-variance models, which is often 408
used as the default initialization in neural networks. For each model, all 150 hidden layer units 409
were ranked according to their context sensitivity index (the F-statistic of activity difference 410
between Context A and Context B). We then progressively lesioned the most context-sensitive 411
units by setting their activations to zero and re-evaluated 2AFC accuracy after each lesion step. 412
413
The moderate-variance models show a steady decline in both Context A and Context B accuracy 414
as more nodes were lesioned (Fig. 5C). This result further indicates a distributed representation 415
strategy where many units contribute uniquely to the representation of current context. In 416
contrast, the low-variance models show evidence of a redundant coding strategy: accuracy, 417
particularly for Context B, remains largely unchanged until around half of the hidden layer was 418
lesioned (Fig. 5C). This delayed performance decline complements the earlier finding of a sparse 419
representation strategy, in which very few nodes showed significant context sensitivity, 420
suggesting that most units carried only shallow, overlapping context signals. Then, when Context 421
B performance began to decline, Context A performance actually increased (Fig. 5D), with 422
performance eventually reaching level comparable to the moderate-weight models (Fig. 5C). This 423
Result
indicates that the Context A representations are present in the knowledge base of the 424
network. However, the context knowledge is not accessible for the 2AFC testing task given the 425
low-variance models show excellent Context B accuracy (the context on which it was more 426
recently trained) but extremely poor Context A accuracy (the previous context) (Fig. 4A). 427
428
To explain the discrepancy in 2AFC accuracy in light of this evidence of Context A knowledge 429
preserved in both low- and moderate-variance models, we examined how efficiently models 430
adapted to context switches. We derived a context switch latency metric operationalized as the 431
number of pairs the model processed after a context switch before perfectly predicting the paired 432
associate of all remaining pairs in each 50-pair context exposure. We found that, by the final 433
block of training, the moderate-variance models exhibited faster switch latencies whereas the 434
low-variance models adapted more slowly (Fig. 5F). This inefficiency likely prevented the low-435
variance models from shifting away from their end-of-training state during the brief six-item 436
exposure provided in each 2AFC trial, leaving them biased toward Context B despite evidence of 437
retaining earlier Context A associations. Overall, this evidence indicates that the moderate-438
variance models achieve the best context-dependent learning because they retain associative 439
knowledge across contexts and quickly adapt to context switches with support from a distributed 440
representation of context across the hidden layer. 441
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
11
442
Discussion
443
444
Given that statistical learning enables associative strengths to be incrementally updated over 445
many exposures, it has been unclear whether it affords sufficient flexibility to adapt to changing 446
associative contingencies in different contexts. The present results extend our understanding of 447
when incidental learning of temporal regularities is possible via demonstration of context-448
dependent statistical learning under circumstances where contexts dynamically alternate, 449
associations directly conflict, and no explicit instructions are provided. We found evidence for this 450
in above-chance performance on the final 2AFC test as well as progressive RT speeding for 451
predictable objects compared to unpredictable objects over the course of learning, which 452
suggests that implicit learning mechanisms facilitated anticipatory behavior.42 453
454
Notably, explicitly signaling context with a colored border (Expt. 2) did not enhance context-455
dependent learning compared to when context was fully latent (Expt. 1). This may reflect greater 456
influence of local temporal context (e.g., recent sequence history, which was available during 457
both learning and retrieval) over environmental cues or disruption of implicit learning mechanisms 458
by promoting a more deliberate strategy. Indeed, recent work suggests that states of reduced 459
executive control, such as mind wandering, can enhance statistical learning relative to focused 460
on-task states,43 implying that exogenously focusing attention via explicit cues may be 461
counterproductive for this type of incidental learning. However, given that the border cue changed 462
color only every few minutes and its relevance was not explicitly conveyed, it is also possible that 463
this visual cueing of context may have been too subtle to provide a performance advantage; a 464
more salient context signal might have produced different effects. 465
466
Prior efforts to demonstrate context-dependent statistical learning with auditory stimuli have been 467
unable to find learning of both contexts unless participants were provided with explicit instructions 468
or salient context cues.18,19 Our success in the visual perceptual domain supports accounts 469
suggesting that statistical learning mechanisms may be modality-specific rather than fully domain-470
general,44 with visual statistical learning potentially more robust to context-based interference 471
under implicit learning conditions. Siegelman et al.20 reported some evidence of context-472
dependent learning in the visual domain using associative structures built from an overlapping set 473
of stimuli. However, their paradigm involved a single consecutive exposure to each of the two 474
contexts, rather than repeated interleaved context switching, self-paced stimulus sequence 475
exposure, and explicit instructions to look for patterns in which shapes tended to follow each 476
other. Such design choices are different from the present study that focuses on shorter, fixed-477
duration stimulus presentations to minimize possibilities for strategic encoding and support the 478
passive, implicit learning that is thought to characterize statistical learning.45 479
480
The present findings build on prior work on second-order conditional (SOC) sequence learning, 481
which has demonstrated that learners can extract higher-order temporal dependencies in which 482
predictability depends on combinations of preceding elements rather than simple pairwise 483
transitions.42,46 Recent work further suggests that exposure to SOC structure can shape 484
subsequent performance and subjective sensitivity to sequence regularities even when explicit 485
knowledge is limited.47 Although the surface structure of these tasks differs from the present 486
paradigm, both lines of work underscore how context-sensitive behavior can emerge from the 487
integration of temporal regularities over experience, without requiring explicit contextual signals. 488
489
We used a neural network modeling approach to inform hypotheses of how the human brain 490
might support such learning. These models were optimized to predict the next object in the 491
sequence. While our human participants engaged in a cover task requiring simple ×/+ perceptual 492
judgments, we assume that they were implicitly forming predictions about upcoming stimuli. 493
Therefore, the models’ predictive framework captures a core computational goal that the human 494
learners pursue implicitly: anticipating future input based on recent experiences.1 495
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
12
496
The GRU’s gating architecture may support its successful context-dependent learning by 497
enabling the model to manage conflicting associations based on retaining relevant information 498
while filtering out noise. This computational function parallels how the human brain manages 499
interference between new and old memories.48 Although GRU models are not intended as 500
models of biological mechanisms, the update and reset gates bear resemblance to the dynamic 501
interplay between the hippocampus and neocortex that supports stability for long-term memory 502
storage34 as well as to neuromodulatory systems where prediction error signals (i.e., dopamine 503
release) prompt a reassessment of context and switch in behavioral strategy.49 Such mechanisms 504
have been hypothesized to facilitate segmenting continuous experiences and recalibration of 505
predictions,50 which may be relevant for context-dependent temporal associative learning and 506
motivate hypotheses for future work examining parallels between biological systems and 507
computational models. 508
509
Prior neural network modeling of context-dependent learning has imbued neural networks with 510
specialized architecture to facilitate latent cause inference.28,29 or have explicitly provided 511
unambiguous context information in model input.22,27 For example, Smith and colleagues51 512
effectively demonstrated that recurrent networks can track temporal structure across multiple 513
timescales within explicitly signaled contexts in a statistical learning paradigm instantiated as 514
games that share response choices. However, these studies bypass the question of how a sense 515
of context might emerge organically from exposure alone to disambiguate overlapping task 516
structure. Additionally, they introduce assumptions that are arguably biologically implausible, such 517
as constant context monitoring and perfectly reliable context cues.5 Here, we more directly focus 518
on latent context discovery by exploring how weight initialization affects learning dynamics. Since 519
network weights are adjusted throughout training to minimize loss, their initial configuration acts 520
as a key driver of convergence.52 Prior work suggests that higher initial weight magnitudes bias 521
models toward “lazy” solutions, involving rapid solution convergence with unstructured 522
representations, while smaller magnitudes support “rich” solutions that exhibit more structured 523
learning albeit at a slower pace.27,33,40 524
525
Indeed, increasing the variance of the uniform distribution used to initialize model weights to a 526
moderate range facilitated successful context-dependent 2AFC performance. This improvement 527
was associated with a high-dimensional, distributed code in the hidden layer that was significantly 528
associated with 2AFC trials of both contexts. This is consistent with studies suggesting that high 529
dimensional codes afforded by mixed selectivity in prefrontal cortex neurons allow for more 530
flexibility and rapid adaptation to new tasks.41,53 The successful distributed context coding 531
strategy where identical model input is represented differently when processed in different 532
contexts is consistent with reports of the hippocampus integrating contextual information into 533
stimulus representations.54,55 Furthermore, the hippocampus supports the rapid learning of 534
temporal associations.37,56 Taken together, these parallels suggest that the moderate-variance 535
GRU models are capturing both higher-level contextual encoding and lower-level temporal 536
associations, consistent with core functions of the hippocampus. 537
538
The variance of weight initialization may be interpreted as shaping the GRU’s inductive bias: the 539
assumptions the model makes about the structure of the environment, particularly regarding the 540
presence and separability of underlying contexts. Low initial weight variance appeared to bias the 541
model towards rigid representations that emphasize on recently experienced associations and 542
failed to recover earlier learned patterns following context shifts. On the other end, high initial 543
weight variance produced overly flexible representations that failed to consolidate stable 544
structure. Our analyses suggest an optimal intermediate range of initial weight magnitudes, where 545
models were sufficiently flexible to distinguish between contexts yet structured enough to 546
preserve associations within each context and avoid catastrophic interference. Accordingly, these 547
effects are best understood as emergent inductive biases shaped by properties of training 548
dynamics and initialization, which may provide insight into how learning systems come to 549
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
13
represent and segregate latent contexts, whether biological or artificial. Future work could assess 550
the extent to which similar learning dynamics arise across architectures and task demands and 551
whether manipulating network hyperparameters, such as learning rate and number of hidden 552
layers, consistently shape the balance plasticity and stability in context-dependent learning. 553
554
To better understand why the moderate variance models succeeded, it is informative to examine 555
the limitations of the low variance models. These models performed poorly on Context A test 556
trials, a pattern that might initially suggest catastrophic interference – that previously learned 557
associations of Context A were overwritten by more recent Context B experience. However, the 558
lesion analysis revealed that Context A knowledge remained in the networks but was not 559
accessible until over half the hidden layer was removed. One likely explanation for this 560
inaccessibility is the slower context switching in low variance models: compared to the moderate 561
variance models, which successfully expressed knowledge of both contexts, the low variance 562
models were slower to accommodate context switches. As a result, the brief context exposure 563
sequences preceding each 2AFC decision may not have provided sufficient evidence to pull them 564
out of their orientation towards Context B state at test, which remained simply because Context B 565
was the last context encountered during training. This phenomenon parallels findings from the 566
fear extinction literature, where extinguished fear responses can re-emerge in a different context, 567
indicating that underlying knowledge is retained but not manifested in behavior when irrelevant to 568
current setting.14 569
570
Another key limitation of the low variance models that emerged from the lesion results was a 571
constraint on how knowledge was represented in the hidden layer units. Before Context A 572
performance recovered, these models showed little to no change in 2AFC accuracy for either 573
context until roughly half the hidden layer was lesioned, in contrast to the steady performance 574
decline observed in moderate and high variance models. This suggests highly redundant coding 575
within the hidden layer. Redundant neural coding is theorized to enhance robustness in noisy 576
environments by duplicating information across neural populations53,57,58 – a potentially 577
advantageous feature for the present task, where many associations directly conflict and half of 578
the training samples are unreliable (e.g., between-pair transitions). Such redundancy could 579
plausibly account for why the low variance models achieved the strongest accuracy on Context B. 580
However, although this redundant coding strategy may help stabilize performance within a single 581
context amidst overall environmental instability, it ultimately proved ineffective because it limited 582
the rapid adaptability needed to operate in a dynamic environment with multiple context-583
dependent structures, resulting in a failure to express knowledge of both contexts. 584
585
Mirroring the diversity of these computational profiles, humans also exhibited considerable 586
variability. Although performance on context-dependent trials was significantly above-chance at 587
the group level, some participants exhibited little or no learning (akin to the high-weight models) 588
while others showed stronger learning of one of the contexts (similar to the low-weight models). 589
Just as some GRU models required more exposure to learn both sets of associations, certain 590
individuals may also need more input to reach stable learning. A promising future direction is to 591
identify model parameters that reflect these individual differences and predict how quickly a 592
learner converges on context-dependent associations, potentially linking such parameters to 593
developmental changes in learning efficiency.59 594
595
Taken together, our findings demonstrate that humans can spontaneously resolve conflicting, 596
context-dependent associations from passive exposure alone – even in the absence of explicit 597
instructions, self-pacing, feedback, or contextual cues. The finding that explicit signaling offered 598
no advantage over entirely latent context exposure further highlights the robustness of this 599
incidental learning mechanism, suggesting that temporal statistics alone are sufficient to drive 600
contextual inference. Our neural network modeling provides a mechanistic account for this 601
capacity, showing that successful adaptation relies on the emergence of distributed 602
representations that are influenced by weight initialization parameters, which we believe are a 603
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
14
reasonable proxy for humans’ inductive biases. This representational strategy not only maintains 604
information from multiple contexts, even when the associative structure directly conflicts across 605
contexts, but also quickly accommodates context changes. These findings suggest that the 606
human brain may rely on similar mechanisms to flexibly manage latent contextual shifts and 607
support adaptive prediction in dynamic environments. 608
609
Materials and methods
610
611
Participants 612
Participants were recruited via the UCLA Psychology Department subject pool and completed the 613
experiment in-person for course credit. All participants provided informed consent in accordance 614
with protocols approved by the UCLA Institutional Review Board (IRB#22-001719). Inclusion 615
criteria of aged between 18-40 years, native English speaker, and normal or corrected-to-normal 616
vision with contacts (no glasses) were confirmed before commencing data collection. Our goal 617
was to obtain useable data from 50 participants for each of the two experiments (Expt. 1: 618
Unsignaled context; Expt. 2: Signaled context), so enough participants were collected to reach 619
our data quality thresholds of 90% of trials responded to and 85% accuracy on trials during the 620
learning phase. These inclusion criteria were enforced to ensure that data analyses focused on 621
participants who were engaged during the learning phase of the experiment. Our final sample 622
included 50 participants for Expt. 1 (33 F / 17 M; mean age = 20.5 years) and 50 different 623
participants for Expt. 2 (40 F / 8 M / 2 Non-Binary; mean age = 20.0 years). 624
625
Materials
626
The experiment was coded and run with PsychoPy version 2024.2.460 on a Mac Mini. Stimuli 627
were displayed on a DELL P2422HE monitor with 1920 by 1080 pixel resolution and screen size 628
of 23.8 inches, which participants viewed from a fixed distance with their head stabilized with a 629
forehead and chin rest. An EyeLink 1000 eye tracker (SR Research) captured gaze location while 630
participants completed the experiment, but eye tracking data are not reported here. Experiment 631
stimuli were drawn from a set of objects created using Blender 2.48.61,62 The stimuli were visually 632
distinct in terms of shape and color and were novel to participants. Images were resized to be 633
350 pixels wide. A small “×” or “+” symbol was subtly embedded onto each object using slight 634
color contrast such that the mark was visible but did not obstruct recognition of object shape. 635
636
Learning phase 637
In the first phase of the experiment, participants were exposed to a sequence of objects 638
presented individually. The objects were presented in four different locations on the screen with a 639
width of 350 pixels and centered 300 pixels above, below, right, and left of the center of a gray 640
screen. At each object presentation, the three positions not occupied by the current object were 641
filled with phase-scrambled versions of other objects cropped into circles with diameter of 300 642
pixels (visualized in SI Appendix, Figure S2). The experimental manipulation of object location 643
was included to enable potential analyses of spatial location-based learning as indexed by 644
anticipatory eye movements. However, because the eye tracking data did not yield clear or 645
interpretable effects, we focus all analyses on object identity and omit spatial position from further 646
consideration, as well as from the task depiction in Fig. 1. Before beginning the learning phase, 647
participants were instructed that parts of the sequence might become familiar over time and that 648
they would later be asked questions about the objects they had seen. 649
650
Unbeknownst to the participants, the objects were organized into two sets (or contexts) of 5 pairs 651
of objects. The same object set was used for all participants but were randomly assigned to each 652
pair position, and each object maintained either first-of-pair (item 1) or second-of-pair (item 2) 653
position in the pair across contexts. One pair was context-independent, meaning the same two 654
objects were paired in both contexts. The other four pairs were context-dependent. Three of 655
these pairs consisted of the same set of six objects across both contexts, but the second item 656
associated with each first item was dependent on context. For example, Object X is paired with 657
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
15
Object Y in Context A but with Object Z in Context B. The last context-dependent pair shared the 658
same first object across both contexts but the paired second object was specific to each context 659
(e.g., only appeared in that context and not the other). In this way, these four context-dependent 660
pairs shared a first item across contexts but the second item of each pair was dependent on 661
context. In total, a set of 11 unique items were used to instantiate the five pairs in each context. 662
663
Throughout this learning phase, participants were tasked with responding to whether the object 664
onscreen was marked with an “×” or “+”. Therefore, reaction time to the perceptual question could 665
be evaluated as online measures of pair structure learning. Objects were presented for 1200ms 666
with a 450ms interstimulus interval. 667
668
The two experiments differed on with respect to whether context was Unsignaled (Expt. 1) or 669
Signaled (Expt. 2) with a border around the objects that was white or black depending on the 670
context. 671
672
Two-alternative forced choice (2AFC) task 673
In the first of three test tasks immediately following this learning phase, participants completed a 674
two-alternative forced choice (2AFC) task. Because the object associations were dependent on 675
active context for all but the one context-independent pair, on each 2AFC trial participants were 676
presented with a sequence of seven objects (consisting of three pairs from one of the contexts 677
and the first item of the test pair) before being presented with two side-by-side alternatives as to 678
which object they think should come next (one was the correct paired associate of the test pair 679
and the other was a lure). Objects were presented with the same timing as used during the 680
learning phase in the sequence, and participants were given unlimited time to make a choice 681
between target and lure. Participants completed a total of 54 questions: 6 of these questions 682
evaluated the context-independent pair, while 48 questions evaluated context-dependent 683
associations. The 48 questions probing context-dependent associations could either feature a 684
lure object that was the correct paired associate in the other context (direct-conflict; 16 685
questions), or a lure that was any other item (indirect-conflict; 32 questions). The “×” and “+” 686
markings were removed from the objects to make clear that participants no longer were required 687
to respond to the perceptual question. For the Unsignaled experiment, no explicit context cues 688
were provided; for the Signaled experiment, the border around the objects was colored white or 689
black on each trial to cue contexts. After making each 2AFC judgment, participants were 690
prompted to rate their confidence in their decision from 1-4. 691
692
Structure knowledge probe 693
After completing all 2AFC trials, participants were prompted to answer some questions about 694
what they learned during the experiment. First, they were asked to respond yes or no to whether 695
they observed any predictable patterns in the experiment. Second, they were asked to describe 696
any patterns they observed in the sequence. Third, they were asked to describe any rules that 697
governed which object would come next in the sequence. The idea was to progressively prompt 698
participants to indicate any knowledge of the pair structure underlying the sequence they 699
observed that were increasingly straightforward to get an idea of how much knowledge was 700
explicit. 701
702
Pair reconstruction task 703
The last task allowed participants to demonstrate explicit knowledge of pairs. Participants were 704
presented with a bank of all 11 objects at the top of the screen and provided with 20 sets of two 705
empty squares side-by-side presented in 4 rows of 5 columns. Participants were instructed to 706
organize the objects into related pairs by placing one object in each square of a pair, with the left 707
and right positions corresponding to the first and second items in the pair. Participants were told 708
that each item could be used more than once and that they did not have to fill out all of the pairs. 709
710
Data cleaning 711
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
16
Data exclusion criteria were enforced to ensure that participants were engaging with the 712
experiment during the learning phase. As such, two criteria were enforced: response rate of more 713
than 90% and accuracy of more than 85% on all responses throughout the learning phase. Data 714
collection continued until 50 useable participants were collected for each experiment. 715
716
Neural network architecture 717
Recurrent neural network models with gated recurrent units were implemented with PyTorch 718
v2.0.1 63. Such models have previously been used to explore context-dependent associative 719
learning from sequences 22,27. Each model had the same architecture: an 11-node input layer with 720
dimensionality of 11 (equal to the number of objects included in the study), a hidden layer of 150 721
nodes (GRU performance with different hidden layer sizes presented in SI Appendix, section S5), 722
and an 11-D output layer again to match dimensionality of one-hot object vectors. Learning rate 723
was held constant at 0.001, and model weights were updated after each training sample using 724
the Adam algorithm of gradient descent and cross entropy loss. Default parameters were used 725
unless otherwise noted. 726
727
Training 728
A unique 1600-object sequence was generated for each model in the same way as for human 729
participants. Each neural network received one object at a time and was trained to predict the 730
identity of the next object in the sequence. Although the sequence was constructed using 731
embedded object pairs, models received no information about this underlying structure. That is, 732
the model made predictions at every time step (1599 samples for the 1600-object sequence) and 733
had no awareness of pair boundaries. The same sequence was used for all epochs of training, 734
with the hidden state was reset at the start of each epoch and between blocks (every 400 735
samples) in recurrent models to emulate the breaks taken by human participants. 736
737
2AFC task 738
After each epoch of learning, model weights were frozen, and the models were evaluated using a 739
2AFC test designed to mirror the testing procedure of the human participants. A unique set of 740
2AFC test questions was generated for each model in the same way as for human participants. 741
Before each trial, hidden layer activity was reset to zero. Then, a sequence of three pairs from 742
one of the contexts was presented as the hidden state evolved, allowing the model to infer the 743
active context based on the sequence. Finally, the first object of the test pair was inputted, and 744
the model’s prediction of the ensuing item was evaluated. Accuracy was determined by whether 745
the probability assigned to the correct paired associate was higher than that to the lure. In most 746
analyses, accuracy is evaluated separately for the context-independent, indirect-conflict context-747
dependent, and direct-conflict context-dependent question sets to capture how well the models 748
handle conflicting information across contexts and maintain knowledge of stable, context-749
independent relationships. 750
751
Single epoch analyses 752
We tested the GRU’s ability to learn the task as the variance of the uniform distribution used to 753
initialize the hidden layer’s weights was increased. The uniform distribution was centered at zero 754
with positive and negative bounds of 0.08 (default for PyTorch with 150 nodes), 0.2, 0.4, 0.6, 0.8, 755
1.0, 1.2, and 1.4. Fifty independent GRU models with different weight initialization randomizations 756
were trained and tested, and learning measures across these models were averaged to ensure 757
robust performance estimates of each weight initialization category. 758
759
Learning trajectory analysis: Context switch latency 760
To assess how quickly the neural network models adapted to a context change, we developed a 761
switch latency measure. We devised a stringent operationalization of switch latency as the 762
number of first-of-pair (item 1) items (e.g., the item whose model output captures the within-pair 763
transition prediction) a model processed after a context switch before achieving perfect accuracy 764
on all remaining item 1 samples in that 50-pair context exposure. Because only the model outputs 765
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
17
of item 1 training samples are predictable and thus learnable, they served as a measure of 766
adaptation to a new context. Switch latency was calculated for all 16 context exposures (8 per 767
context) and averaged within each block (4 context exposures), yielding a single switch latency 768
value per block. We averaged this measure across all 50 trained models for each weight 769
initialization condition. 770
771
Hidden layer analyses: Context representation strategies 772
To understand how context information was represented across the hidden layer, we quantified 773
two complementary properties of the activations: the extent to which context sensitivity was 774
localized to a small subset of units (akin to individual “context cell” neurons that code for which 775
context is currently active), and the degree to which the currently active context was expressed a 776
as distinctive pattern of activity across many units. These analyses were conceptually motivated 777
by prior work distinguishing sparse and distributed coding strategies in hippocampal and 778
connectionist models.35,41 Our goal was to determine whether these representational properties 779
could explain 2AFC task performance across all individual GRU model instances of the weight 780
variance configurations. 781
782
A sparse representation describes when context sensitivity is confined to a relatively small subset 783
of hidden layer units, while the vast majority remain inactive or insensitive. To investigate sparse 784
context representations in the GRU’s hidden layer, we first used a one-way ANOVA to estimate 785
the difference in activation when processing inputs from Context A and Context B during the final 786
quarter of training (block 4) for each of the 150 hidden layer nodes. We then counted the number 787
of nodes that showed a significant activation difference. We applied a Bonferroni correction within 788
analysis of each model to control for Type I errors of the 150 comparisons were performed. The 789
corrected significance threshold was computed by dividing the original alpha level (0.05) by the 790
number of comparisons (150), yielding an adjusted significance level of p < 0.00033. Based on 791
this threshold, we determined that Fcrit(1,398) = 12.75 and calculated the sparse representation 792
index as the proportion of hidden layer nodes that did not show a significant difference in 793
activation between contexts, such that a larger value reflects a sparser context representation. 794
795
A distributed representation was computed using a representational similarity analysis (RSA; 64) 796
focused on the hidden layer activations after processing the first item of each pair (capturing the 797
context-dependent prediction) during the final quarter of training (block 4). This included a total of 798
200 hidden state samples (100 per context). These activations were divided into two split-halves, 799
each containing 10 samples for each of the five pairs per context. These samples in each split-800
half were evenly divided into those drawn from the first half of a context exposure and those from 801
the second half, controlling for any strengthening of context representation over time. We 802
averaged the hidden state activation within each node for each object within each context. We 803
then computed the Pearson correlation coefficient for all pairwise comparisons of objects within 804
and across contexts, producing an RSA matrix. This matrix compared the split-half object 805
representations, with one half plotted along the x-axis and the other along the y-axis, and there 806
were 10 cells along each axis for each of the five pairs viewed in each context. The upper-left and 807
lower-right quadrants contained correlations between the five pairs from the same context. The 808
upper-right and lower-left quadrants contained correlations between the same objects when 809
viewed in opposing contexts. To quantify the distributed representation, we calculated the 810
difference between the average within-context and between-context correlations for each object, 811
normalized by subtracting the average within-context correlation from one. This normalization 812
penalized models with lower within-context stability because a larger denominator as a result of 813
lower within-context similarity would decrease the overall distributed representation index, 814
ensuring that observed differences in between-context representation were not artifacts of noisy 815
or unstable object representations. 816
817
Hidden layer analyses: Lesion analysis 818
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
18
To further understand how the GRU model’s hidden layer representations supported successful 819
context-dependent learning, we conducted an intervention analysis. For all weight configurations, 820
we trained a new set of 50 models with the same single-epoch training procedure and then 821
systematically tested their performance on the 2AFC task while “lesioning” (zeroing out) subsets 822
of hidden layer nodes. Importantly, these nodes were active during training on the 1600-object 823
sequence, and the lesioning intervention was applied only immediately prior to the 2AFC testing 824
phase. 825
826
We first calculated the absolute context sensitivity of each hidden layer node using a one-way 827
ANOVA of activations between Context A and Context B during the final quarter of training (block 828
4), in the same way as computing the sparse representation index. Nodes were then ranked from 829
the most context-sensitive (largest F-statistic) to least sensitive. During the 2AFC task, subsets of 830
hidden layer nodes were progressively lesioned, beginning with the most context-sensitive nodes. 831
We evaluated models under the following lesioning conditions: 0 (no nodes lesioned to obtain 832
baseline performance estimate), 1, 5, 10, 25, 50, 75, 100, 125, 130, 135, 140, 145, and 150 (all 833
nodes lesioned with expectation of chance performance). We report the average 2AFC 834
performance on Context A, Context B, and context-independent question sets, expressed as both 835
obtained accuracy and the change in performance relative to the no-lesion baseline (e.g., when 836
no intervention is applied). 837
838
Statistical Analysis 839
2AFC task 840
2AFC task performance was assessed by evaluating accuracy and average confidence rating on 841
subsets of 2AFC questions. Group-level accuracy was tested against 50% chance using a one-842
sample Student’s t-test with Holm-Bonferroni correction applied for three comparisons (Context A, 843
Context B, and context-independent trials) within each experiment. 844
845
We conducted a mixed-design ANOVA to examine the effects of experiment (between-subjects 846
factor: Expt. 1 versus Expt. 2) and context-dependence (within-subject factor: context-dependent 847
versus context-independent trials) on 2AFC accuracy using the Python pingouin package. To 848
assess whether context-dependent 2AFC accuracy was statistically equivalent between 849
experiments, we used Bayesian estimation with a region of practical equivalence (ROPE) 850
approach. We computed the posterior distribution of the mean difference in accuracy between 851
Expt. 1 and Expt. 2 and quantified the proportion of the posterior mass falling within a predefined 852
ROPE of [-5%, 5%]. This ROPE was selected to reflect the smallest effect size of interest, 853
consistent with typical variability in task accuracy in statistical learning literature. Posterior 854
distributions were estimated using the PyMC package. 855
856
Online learning assessment 857
We quantified online learning using participants’ RTs during the learning phase, in which they 858
indicated whether each object contained an “×” or “+”. For each block, we computed an 859
anticipation score as the average RT to the second item of each pair subtracted from the average 860
RT to the first item of each pair. This metric captured facilitation for predictable second items 861
while controlling for overall RT drift throughout the session, as first items follow unpredictable 862
transitions. Positive values indicate faster responses to second items relative to first items. 863
864
To assess changes in online learning across blocks, we applied linear contrast with weights [-3, -865
1, 1, 3] to the blockwise anticipation scores for each participant and tested the group-level 866
difference from zero using a two-tailed one-sample t-test for each experiment. 867
868
Multivariate regression analysis of representation strategies 869
The two measures of hidden layer activity – sparse representation index (proportion of hidden 870
layer nodes that do not show significant activation difference by context) and distributed 871
representation index (within- versus between-context correlation differences) – were used as 872
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
19
predictors in multivariate regression models aimed to explain the variance of 2AFC context-873
dependent accuracy, context-independent accuracy, and accuracy difference between contexts. 874
These indices were first computed for each of the 50 models instantiated with each of the eight 875
weight initialization configurations, and then were averaged within each configuration. The 876
resulting eight values for each predictor were z-scored to enable comparison of beta coefficients 877
across predictors. By including both metrics in the same regression models, we assessed their 878
unique contributions to task performance. This allowed us to evaluate whether sparse or 879
distributed representations were more predictive of learning outcomes, providing insight into the 880
mechanisms underlying the GRU model’s ability to process and adapt to context-dependent 881
associations. 882
883
Acknowledgements
884
885
F.P. was supported by the National Science Foundation Graduate Research Fellowship Program 886
under Grant Nos. DGE-2034835 and DGE-2444110. 887
888
Author Contributions 889
890
Conceptualization: FCP, JR, HL; Methodology: FCP, JR, HL; Software: FCP; Formal analysis: 891
FCP; Visualization: FCP; Supervision: JR, HL; Writing – original draft: FCP; Writing – review & 892
editing: FCP, HL, JR. 893
894
Declaration of interests 895
896
The authors declare no competing interests. 897
898
References
899
900
1. Friston, K. (2010). The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 901
127–138. https://doi.org/10.1038/nrn2787. 902
2. Bar, M. (2009). The proactive brain: memory for predictions. Philos. Trans. R. Soc. Lond. B. 903
Biol. Sci. 364, 1235–1243. https://doi.org/10.1098/rstb.2008.0310. 904
3. Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of 905
cognitive science. Behav. Brain Sci. 36, 181–204. 906
https://doi.org/10.1017/S0140525X12000477. 907
4. Summerfield, C., and de Lange, F.P. (2014). Expectation in perceptual decision making: 908
neural and computational mechanisms. Nat. Rev. Neurosci. 15, 745–756. 909
https://doi.org/10.1038/nrn3838. 910
5. Heald, J.B., Lengyel, M., and Wolpert, D.M. (2023). Contextual inference in learning and 911
memory. Trends Cogn. Sci. 27, 43–64. https://doi.org/10.1016/j.tics.2022.10.004. 912
6. Heald, J.B., Wolpert, D.M., and Lengyel, M. (2023). The Computational and Neural Bases of 913
Context-Dependent Learning. Annu. Rev. Neurosci. 46, 233–258. 914
https://doi.org/10.1146/annurev-neuro-092322-100402. 915
7. Statistical Learning (2015). 501–506. https://doi.org/10.1016/B978-0-12-397025-1.00276-1. 916
8. Statistical Learning (2015). In Brain Mapping (Elsevier), pp. 501–506. 917
https://doi.org/10.1016/b978-0-12-397025-1.00276-1. 918
9. Sherman, B.E., Graves, K.N., and Turk-Browne, N.B. (2020). The prevalence and importance 919
of statistical learning in human cognition and behavior. Curr. Opin. Behav. Sci. 32, 15–20. 920
https://doi.org/10.1016/j.cobeha.2020.01.015. 921
10. Saffran, J.R., and Kirkham, N.Z. (2018). Infant Statistical Learning. Annu. Rev. Psychol. 69, 922
181–203. https://doi.org/10.1146/annurev-psych-122216-011805. 923
11. Fiser, J., and Aslin, R.N. (2001). Unsupervised Statistical Learning of Higher-Order Spatial 924
Structures from Visual Scenes. Psychol. Sci. 12, 499–504. https://doi.org/10.1111/1467-925
9280.00392. 926
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
20
12. Saffran, J.R., Aslin, R.N., and Newport, E.L. (1996). Statistical Learning by 8-Month-Old 927
Infants. Science 274, 1926–1928. 928
13. Conway, C.M., and Christiansen, M.H. (2005). Modality-Constrained Statistical Learning of 929
Tactile, Visual, and Auditory Sequences. J. Exp. Psychol. Learn. Mem. Cogn. 31, 24–39. 930
https://doi.org/10.1037/0278-7393.31.1.24. 931
14. Bouton, M.E. (1993). Context, time, and memory retrieval in the interference paradigms of 932
Pavlovian learning. Psychol. Bull. 114, 80–99. https://doi.org/10.1037/0033-2909.114.1.80. 933
15. McAllister, D.E., and McAllister, W.R. (1994). Extinction and Reconditioning of Classically 934
Conditioned Fear before and after Instrumental Learning: Effects of Depth of Fear Extinction. 935
Learn. Motiv. 25, 339–367. https://doi.org/10.1006/lmot.1994.1018. 936
16. Bouton, M.E. (2004). Context and Behavioral Processes in Extinction. Learn. Mem. 11, 485–937
494. https://doi.org/10.1101/lm.78804. 938
17. Izquierdo, A., and Jentsch, J.D. (2012). Reversal learning as a measure of impulsive and 939
compulsive behavior in addictions. Psychopharmacology (Berl.) 219, 607–620. 940
https://doi.org/10.1007/s00213-011-2579-7. 941
18. Weiss, D.J., Gerfen, C., and Mitchel, A.D. (2009). Speech Segmentation in a Simulated 942
Bilingual Environment: A Challenge for Statistical Learning? Lang. Learn. Dev. 5, 30–49. 943
https://doi.org/10.1080/15475440802340101. 944
19. Gebhart, A.L., Aslin, R.N., and Newport, E.L. (2009). Changing Structures in Midstream: 945
Learning Along the Statistical Garden Path. Cogn. Sci. 33, 1087–1116. 946
https://doi.org/10.1111/j.1551-6709.2009.01041.x. 947
20. Siegelman, N., Bogaerts, L., Kronenfeld, O., and Frost, R. (2018). Redefining “Learning” in 948
Statistical Learning: What Does an Online Measure Reveal About the Assimilation of Visual 949
Regularities? Cogn. Sci. 42, 692–727. https://doi.org/10.1111/cogs.12556. 950
21. Qian, T., Jaeger, T.F., and Aslin, R.N. (2016). Incremental implicit learning of bundles of 951
statistical patterns. Cognition 157, 156–173. https://doi.org/10.1016/j.cognition.2016.09.002. 952
22. Smith, C.M., Thompson-Schill, S.L., and Schapiro, A.C. (2024). Rapid Learning of Temporal 953
Dependencies at Multiple Timescales. J. Cogn. Neurosci. 36, 2343–2356. 954
https://doi.org/10.1162/jocn_a_02232. 955
23. Heald, J.B., Lengyel, M., and Wolpert, D.M. (2021). Contextual inference underlies the 956
learning of sensorimotor repertoires. Nature 600, 489–493. https://doi.org/10.1038/s41586-957
021-04129-3. 958
24. Yamins, D.L.K., and DiCarlo, J.J. (2016). Using goal-driven deep learning models to 959
understand sensory cortex. Nat. Neurosci. 19, 356–365. https://doi.org/10.1038/nn.4244. 960
25. Saxe, A., Nelli, S., and Summerfield, C. (2021). If deep learning is the answer, what is the 961
question? Nat. Rev. Neurosci. 22, 55–67. https://doi.org/10.1038/s41583-020-00395-8. 962
26. Alamia, A., Gauducheau, V., Paisios, D., and VanRullen, R. (2020). Comparing feedforward 963
and recurrent neural network architectures with human behavior in artificial grammar 964
learning. Sci. Rep. 10, 22172. https://doi.org/10.1038/s41598-020-79127-y. 965
27. Flesch, T., Juechems, K., Dumbalska, T., Saxe, A., and Summerfield, C. (2022). Orthogonal 966
representations for robust context-dependent task performance in brains and neural 967
networks. Neuron 110, 1258-1270.e11. https://doi.org/10.1016/j.neuron.2022.01.005. 968
28. Lu, Q., Nguyen, T.T., Zhang, Q., Hasson, U., Griffiths, T.L., Zacks, J.M., Gershman, S.J., and 969
Norman, K.A. (2024). Reconciling shared versus context-specific information in a neural 970
network model of latent causes. Sci. Rep. 14, 16782. https://doi.org/10.1038/s41598-024-971
64272-5. 972
29. Franklin, N.T., Norman, K.A., Ranganath, C., Zacks, J.M., and Gershman, S.J. (2020). 973
Structured Event Memory: A neuro-symbolic model of event cognition. Psychol. Rev. 127, 974
327–361. https://doi.org/10.1037/rev0000177. 975
30. Elman, J.L. (1990). Finding Structure in Time. Cogn. Sci. 14, 179–211. 976
https://doi.org/10.1207/s15516709cog1402_1. 977
31. Hasson, U., Nastase, S.A., and Goldstein, A. (2020). Direct Fit to Nature: An Evolutionary 978
Perspective on Biological and Artificial Neural Networks. Neuron 105, 416–434. 979
https://doi.org/10.1016/j.neuron.2019.12.002. 980
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
21
32. Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and 981
Generalization in Neural Networks. In Advances in Neural Information Processing Systems 982
(Curran Associates, Inc.). 983
33. Chizat, L., Oyallon, E., and Bach, F. (2019). On Lazy Training in Differentiable Programming. 984
In Advances in Neural Information Processing Systems (Curran Associates, Inc.). 985
34. McClelland, J.L., McNaughton, B.L., and O’Reilly, R.C. (1995). Why there are complementary 986
learning systems in the hippocampus and neocortex: Insights from the successes and 987
failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457. 988
https://doi.org/10.1037/0033-295X.102.3.419. 989
35. Schapiro, A.C., Turk-Browne, N.B., Botvinick, M.M., and Norman, K.A. (2017). 990
Complementary learning systems within the hippocampus: a neural network modelling 991
approach to reconciling episodic memory with statistical learning. Philos. Trans. R. Soc. B 992
Biol. Sci. 372, 20160049. https://doi.org/10.1098/rstb.2016.0049. 993
36. Leutgeb, J.K., Leutgeb, S., Moser, M.-B., and Moser, E.I. (2007). Pattern Separation in the 994
Dentate Gyrus and CA3 of the Hippocampus. Science 315, 961–966. 995
https://doi.org/10.1126/science.1135801. 996
37. Schapiro, A.C., Rogers, T.T., Cordova, N.I., Turk-Browne, N.B., and Botvinick, M.M. (2013). 997
Neural representations of events arise from temporal community structure. Nat. Neurosci. 16, 998
486–492. https://doi.org/10.1038/nn.3331. 999
38. Welford, W.T., Brebner, J.M.T., and Kirby, N. (1980). Reaction Times (Stanford University). 1000
39. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization 1001
and momentum in deep learning. In Proceedings of the 30th International Conference on 1002
Machine Learning (PMLR), pp. 1139–1147. 1003
40. Dominé, C.C.J., Anguita, N., Proca, A.M., Braun, L., Kunin, D., Mediano, P.A.M., and Saxe, 1004
A.M. (2025). From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks. Preprint 1005
at arXiv, https://doi.org/10.48550/arXiv.2409.14623 1006
https://doi.org/10.48550/arXiv.2409.14623. 1007
41. Hinton, G.E. (1986). Learning Distributed Representations of Concepts. Proc. Annu. Meet. 1008
Cogn. Sci. Soc. 8. 1009
42. Destrebecqz, A., and Cleeremans, A. (2001). Can sequence learning be implicit? New 1010
evidence with the process dissociation procedure. Psychon. Bull. Rev. 8, 343–350. 1011
https://doi.org/10.3758/BF03196171. 1012
43. Vékony, T., Farkas, B.C., Brezóczki, B., Mittner, M., Csifcsák, G., Simor, P., and Németh, D. 1013
(2025). Mind wandering enhances statistical learning. iScience 28. 1014
https://doi.org/10.1016/j.isci.2024.111703. 1015
44. Conway, C.M. (2020). How does the brain learn environmental structure? Ten core principles 1016
for understanding the neurocognitive mechanisms of statistical learning. Neurosci. Biobehav. 1017
Rev. 112, 279–299. https://doi.org/10.1016/j.neubiorev.2020.01.032. 1018
45. Perruchet, P., and Pacton, S. (2006). Implicit learning and statistical learning: one 1019
phenomenon, two approaches. Trends Cogn. Sci. 10, 233–238. 1020
https://doi.org/10.1016/j.tics.2006.03.006. 1021
46. Cleeremans, A., and McClelland, J.L. (1991). Learning the structure of event sequences. J. 1022
Exp. Psychol. Gen. 120, 235–253. https://doi.org/10.1037/0096-3445.120.3.235. 1023
47. Chiarella, S.G., Simione, L., D’Angiò, M., Saracini, C., Raffone, A., and Di Pace, E. (2026). 1024
Implicit observational learning of second-order conditional repeated sequences presented in 1025
rapid serial visual presentation. Conscious. Cogn. 137, 103967. 1026
https://doi.org/10.1016/j.concog.2025.103967. 1027
48. O’Reilly, R.C., and Rudy, J.W. (2001). Conjunctive representations in learning and memory: 1028
Principles of cortical and hippocampal function. Psychol. Rev. 108, 311–345. 1029
https://doi.org/10.1037/0033-295X.108.2.311. 1030
49. Glimcher, P.W. (2011). Understanding dopamine and reinforcement learning: The dopamine 1031
reward prediction error hypothesis. Proc. Natl. Acad. Sci. 108, 15647–15654. 1032
https://doi.org/10.1073/pnas.1014269108. 1033
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
22
50. Zacks, J.M., Kurby, C.A., Eisenberg, M.L., and Haroutunian, N. (2011). Prediction Error 1034
Associated with the Perceptual Segmentation of Naturalistic Events. J. Cogn. Neurosci. 23, 1035
4057–4066. https://doi.org/10.1162/jocn_a_00078. 1036
51. Smith, C.M., Thompson-Schill, S.L., and Schapiro, A.C. (2024). Rapid Learning of Temporal 1037
Dependencies at Multiple Timescales. J. Cogn. Neurosci. 36, 2343–2356. 1038
https://doi.org/10.1162/jocn_a_02232. 1039
52. Narkhede, M.V., Bartakke, P.P., and Sutaone, M.S. (2022). A review on weight initialization 1040
strategies for neural networks. Artif. Intell. Rev. 55, 291–322. https://doi.org/10.1007/s10462-1041
021-10033-z. 1042
53. Rigotti, M., Barak, O., Warden, M.R., Wang, X.-J., Daw, N.D., Miller, E.K., and Fusi, S. 1043
(2013). The importance of mixed selectivity in complex cognitive tasks. Nature 497, 585–590. 1044
https://doi.org/10.1038/nature12160. 1045
54. Mızrak, E., Bouffard, N.R., Libby, L.A., Boorman, E.D., and Ranganath, C. (2021). The 1046
hippocampus and orbitofrontal cortex jointly represent task structure during memory-guided 1047
decision making. Cell Rep. 37, 110065. https://doi.org/10.1016/j.celrep.2021.110065. 1048
55. Chanales, A.J.H., Oza, A., Favila, S.E., and Kuhl, B.A. (2017). Overlap among Spatial 1049
Memories Triggers Repulsion of Hippocampal Representations. Curr. Biol. 27, 2307-2317.e5. 1050
https://doi.org/10.1016/j.cub.2017.06.057. 1051
56. Schapiro, A.C., Kustner, L.V., and Turk-Browne, N.B. (2012). Shaping of Object 1052
Representations in the Human Medial Temporal Lobe Based on Temporal Regularities. Curr. 1053
Biol. 22, 1622–1627. https://doi.org/10.1016/j.cub.2012.06.056. 1054
57. Barlow, H. (2001). Redundancy reduction revisited. Netw. Bristol Engl. 12, 241–253. 1055
58. Fusi, S., and Abbott, L.F. (2007). Limits on the memory storage capacity of bounded 1056
synapses. Nat. Neurosci. 10, 485–493. https://doi.org/10.1038/nn1859. 1057
59. Forest, T.A., Schlichting, M.L., Duncan, K.D., and Finn, A.S. (2023). Changes in statistical 1058
learning across development. Nat. Rev. Psychol. 2, 205–219. https://doi.org/10.1038/s44159-1059
023-00157-0. 1060
60. Peirce, J., Gray, J.R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., 1061
and Lindeløv, J.K. (2019). PsychoPy2: Experiments in behavior made easy. Behav. Res. 1062
Methods
51, 195–203. https://doi.org/10.3758/s13428-018-01193-y. 1063
61. Hsu, N.S., Schlichting, M.L., and Thompson-Schill, S.L. (2014). Feature Diagnosticity Affects 1064
Representations of Novel and Familiar Objects. J. Cogn. Neurosci. 26, 2735–2749. 1065
https://doi.org/10.1162/jocn_a_00661. 1066
62. Schlichting, M.L., Mumford, J.A., and Preston, A.R. (2015). Learning-related representational 1067
changes reveal dissociable integration and separation signatures in the hippocampus and 1068
prefrontal cortex. Nat. Commun. 6, 8151. https://doi.org/10.1038/ncomms9151. 1069
63. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., 1070
Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An Imperative Style, High-Performance 1071
Deep Learning Library. Preprint at arXiv, https://doi.org/10.48550/arXiv.1912.01703 1072
https://doi.org/10.48550/arXiv.1912.01703. 1073
64. Kriegeskorte, N. (2011). Pattern-information analysis: From stimulus decoding to 1074
computational-model testing. NeuroImage 56, 411–421. 1075
https://doi.org/10.1016/j.neuroimage.2011.01.061. 1076
1077
1078
1079
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
23
1080
Supplementary Information Appendix 1081
1082
1083
1084
Fig. S1. 2AFC test performance by direct and indirect conflict question subsets. 1085
Bar height reflects group average on 2AFC context-dependent question subsetted by indirect 1086
(light coloring) and direct (dark coloring) conflict for Context A (left, orange bars) and Context B 1087
(right, green bars). ***p<0.001; **p<0.01; *p<0.05. 1088
1089
1090
1091
Fig. S2. Visualization of stimulus presentation during learning phase. 1092
Each trial of the learning phase featured four stimuli arranged as depicted, with one object of 1093
interest (on which participants needed to make an × /+ judgment) and three circular phase-1094
scrambled objects presented in the remaining positions. A black or white border was present 1095
during Expt. 2. Visualization of objects and border is to scale. 1096
1097
Table S1. Reaction times (mean ± standard deviation) in milliseconds by block for context-1098
dependent pair objects. Item 1 is the first, unpredictable element of each pair; Item 2 is the 1099
second, predictable element informed by the associative expectation. 1100
Expt 1: Unsignaled Expt 2: Signaled
Item 1 Item 2 Item 1 Item 2
Block 1 712 ± 72 727 ± 75 710 ± 67 713 ± 57
Block 2 668 ± 77 679 ± 84 662 ± 76 664 ± 64
Block 3 650 ± 76 655 ± 85 648 ± 77 643 ± 62
Block 4 639 ± 74 639 ± 83 631 ± 71 626 ± 62
1101
1102
1103
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
24
S1 Confidence judgments and explicit knowledge assessments 1104
1105
We examined whether participants’ confidence ratings on the 2AFC task were related to their 1106
accuracy. Confidence was significantly higher for accurate than inaccurate 2AFC responses for 1107
context-dependent trials in both experiments (Unsignaled: t(49) = 4.1, p < 0.001; Signaled: t(49) = 1108
3.96, p < 0.001) and on context-independent trials in Expt. 1 (Unsignaled; t(37) = 2.6, p = 0.013) 1109
but not Expt. 2 (t(39) = 0.38, p = 0.7) (Fig. S3A). Participants who responded entirely correctly or 1110
incorrectly were excluded from analysis; all participants had mixed accuracy on context-1111
dependent trials. Overall, mean confidence ratings were around or below the midpoint of the 1112
scale, indicating generally low subjective certainty during the 2AFC task. 1113
1114
Following the 2AFC task, participants completed two additional assessments designed to 1115
measure explicit knowledge of the temporal structure: a Structure Knowledge Probe and Pair 1116
Reconstruction Task. 1117
1118
For the Structure Knowledge Probe, binary performance was assessed by manually evaluating 1119
whether participants articulated explicit awareness of temporal pair structure in their written 1120
responses. This measure did not evaluate knowledge of the dual context structure (e.g., 1121
participants did not need to articulate awareness that there were two distinct contexts where the 1122
associative pairings changed). Two independent raters coded all responses with 91% agreement; 1123
discrepancies were resolved by deferring to the more senior grader. Explicit knowledge of the pair 1124
structure was identified in 34.0% of participants in Expt. 1 and 40.0% in Expt. 2. 1125
1126
Performance on the Pair Reconstruction Task varied because participants could report between 1 1127
and 20 pairs. To estimate chance performance, we implemented Monte Carlo simulations where 1128
1,000 simulations were run for each possible number of reported pairs (k = 1-20). In each 1129
simulation, an object was sampled from the 11 unique objects with replacement between pairs 1130
but without replacement within a pair (e.g., no pair comprised of the same object). This produced 1131
a null distribution of proportion correct entries for each k expected by chance. This empirical 1132
approach matches the analytical solution: there were 9 correct pairs (because one pair was 1133
context-independent and thus correct in both contexts), and the probability of guessing one 1134
correct pair by chance was 1/110 (choosing 2 of the 11 objects without replacement). Thus, the 1135
probability of guessing one of the 9 correct pairs was 9/110, or 8.2%. 1136
1137
Participants reported an average of 7.4 ± 4 in Expt. 1 and 7.9 ± 4 in Expt. 2 (Fig. S3B). Group-1138
level significance was calculated as the average number of correct context-independent pairs 1139
was greater than expected by chance over the simulations of all possible pair entry counts. 1140
Context-dependent pair entry performance was non-significant for both experiments (Unsignaled: 1141
mean = 2.06 pairs; p = 0.11; Signaled: mean = 2.08 pairs; p = 0.11; Fig. S3B). Moreover, only 1142
36% of Expt. 1 participants and 34.7% of Expt. 2 participants (e.g., 17 out of 49 participants; one 1143
participant did not complete this portion of the experiment) reported the context-independent pair. 1144
1145
Taken together, these results indicate that most participants had little to no explicit knowledge of 1146
the temporal pair structure: they were generally unable to recall the context-independent pair, 1147
articulate the underlying pair structure, or reconstruct the context-dependent associations. Thus, 1148
the significant 2AFC performance reflecting context-dependent learning is unlikely to have been 1149
driven by explicit awareness. 1150
1151
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
25
1152
Fig. S3. Metacognitive awareness assessment results. 1153
(A) Average 2AFC confidence rating for context-dependent (left) and context-dependent (right) 1154
questions for accurate (pink) and inaccurate (blue) question responses. Horizontal dashed line 1155
indicates midpoint of the confidence scale, and error bars reflect SEM. (B) Pair Reconstruction 1156
Task performance: bar height reflects group average number of total pairs reported (left) and 1157
correct context-independent pairs reported (right). Horizontal line reflects chance performance 1158
based on Monte Carlo simulations; error bars reflect SEM. 1159
1160
S2 Model architecture comparison 1161
1162
In the main paper, we analyze a GRU model with training that was constrained to a single epoch, 1163
equivalent to the total sequence exposure of each human participant. Here, we justify that 1164
decision with comparison to two simpler models: a feedforward neural network (FFNN) and a 1165
vanilla recurrent neural network (RNN) that lacked gated recurrent units. One learning phase 1166
sequence and one set of 2AFC questions were generated for each model in the same way as for 1167
human participants (1,600 objects), and each epoch of training consisted of updating model 1168
weights to predict the next item in this sequence, followed by an assessment of 2AFC accuracy 1169
with frozen weights. For each model, we continued this process for a total of 50 epochs (i.e. 50 1170
times the sequence exposure given to human participants). All models here used the default 1171
PyTorch weight initialization where weights are drawn from a uniform distribution bounded by plus 1172
or minus the inverse of the square root of the layer size, which was 0.08 for the hidden layer. 1173
1174
The simplest architecture, the FFNN, achieved an overall context-dependent accuracy of almost 1175
75% (Fig. S4A). However, this performance was entirely driven by near-perfect accuracy on 1176
Context B, the most recently trained context, while accuracy on Context A remained near chance. 1177
This indicates that the FFNN retained knowledge only about the most recent associations, 1178
completely overwriting previously learned, conflicting ones—a hallmark of catastrophic 1179
interference. 1180
1181
RNNs improve on feedforward model capabilities by incorporating information from past hidden 1182
states with the current state, enabling them to process sequential input. However, the RNN 1183
showed no improvement in overall context-dependent accuracy compared to the FFNN (Fig. 1184
S4B). While Context A performance did increase over learning, this improvement came at the 1185
expense of Context B performance, suggesting the RNN is also prone to interference. 1186
1187
The RNN with GRUs, an advanced RNN variant, overcomes the limitations by using update and 1188
reset gates to manage long-term dependencies more effectively. Initially, GRU performance was 1189
comparable to the FFNN (Fig. S4C). However, with extended training (approximately 20 epochs; 1190
i.e., 20 times the exposure of human participants), the GRU achieved comparably high accuracy 1191
on both Context A and Context B. This performance likely stems from the GRU’s architectural 1192
advantages. The update gate controls how much new input influences retained memory, allowing 1193
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
26
the model to ignore unreliable input (such as noisy between-pair transitions). The reset gate 1194
allows the selective clearing of irrelevant information in response to context changes, thereby 1195
avoiding interference from outdated associations. Given that the GRU model provides the best 1196
account for context-dependent learning in humans, we used the GRU model for all of the single-1197
epoch modeling analyses reported in the main text. 1198
1199
1200
Fig. S4. 2AFC task performance for three neural network model classes. 1201
(A-C) 2AFC performance accuracy on context-dependent questions averaged for 50 individual 1202
models for each architecture after each of 50 epochs of training; accuracy is plotted separately for 1203
Context A (orange), Context B (green), and both contexts combined (blue) for each model class: 1204
(A) simple feedforward neural network (FFNN), (B) vanilla recurrent neural network (RNN), and 1205
(C) recurrent neural network with gated recurrent units (GRU). All models achieved perfect 1206
accuracy on context-independent questions after the first epoch (not pictured). 1207
S3 GRU performance with perceptual object representations 1208
1209
The main paper used one-hot vector representations for each object in the modeling analysis. 1210
This choice ensured that all objects were represented equally and orthogonally, such that any 1211
structure emerging in the hidden layer reflected purely learned associations rather than 1212
preexisting similarities among the inputs. Here, we present the same analysis using perceptual 1213
object representations that more closely approximate the visual experience of human participants 1214
in the task. Perceptual object representations were generated by inputting each object image 1215
(without the overlaid plus or minus symbol) into AlexNet (1) and then applying PCA to reduce the 1216
dimensionality to 11 dimensions, matching the number of input and output dimensions of the 1217
original model. GRU networks were trained on the same context-dependent sequential prediction 1218
task as in the main text, using cosine similarity as the loss function, and assignment of objects to 1219
specific pairs was randomized for each model in the same way as for human participants. 1220
As shown in Fig. S5, the relationship between initialized weight variance and 2AFC accuracy 1221
retained the same non-monotonic profile observed in the models that used one-hot input coding 1222
(Fig. 4A). In addition, 2AFC performance was higher for all question subsets. This improvement is 1223
unsurprising: the use of AlexNet embeddings introduces a strong visual prior that allows the 1224
model to exploit shared perceptual features when making predictions, thereby obscuring the 1225
interpretability of how the hidden layer activity represents the task’s temporal associative 1226
structure. For example, the model could leverage arbitrary similarities in dimensions such as 1227
shape and color to bias its 2AFC responses. In contrast, one-hot encodings constrain all non-1228
active dimensions to zero, ensuring that any hidden layer structure arises exclusively from 1229
learning the task’s associative regularities. Taken together, these results confirm that the core 1230
finding of optimal task performance emerging at moderate weight initialization variance holds 1231
regardless of input representation. We therefore focus analysis on models trained with one-hot 1232
object encodings because they provide a controlled representational space in which hidden layer 1233
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
27
structure reflects learning task’s associative structure rather than preexisting perceptual 1234
relationships among stimuli. 1235
1236
1237
Fig. S5. 2AFC performance with perceptual object embeddings. 1238
2AFC accuracy (y-axis) on context-dependent test trials for GRU models with weights initialized 1239
with increasing variance along the x-axis color-coded by question category. 1240
1241
1242
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
28
S4 GRU performance with a fully overlapping stimulus set (no context-specific objects) 1243
1244
The pair assignment across contexts for both models and human participants included one 1245
possible caveat to our claim of latent context-dependent learning: for one of the context-1246
dependent pair types, the second (paired) item appeared in only one of the contexts while the first 1247
item occurred in both (visualized in bottom row of Fig. 1B). This structure still required humans 1248
and the models to update their prediction of the second item based on the inferred context 1249
(consistent with all other context-dependent pairs), but also meant that the context-specific 1250
second item could have been used as a context cue independent of recent sequence history 1251
(Expt. 1) and/or border color (Expt. 2). In other words, an encounter with a context-specific object 1252
could be an indicator that the state of the world has changed and thus that one’s associative 1253
predictions should be updated. We note that our decision to include this pair type was motivated 1254
by our intention to collect fMRI data with this paradigm, which will allow us to assess changes in 1255
neural representational geometry when an object’s associative identity remains constant across 1256
contexts, providing a baseline for evaluating relative changes in other pair conditions. 1257
1258
To evaluate whether the presence of context-specific objects influenced model learning, we ran 1259
neural network simulations in which such pairs were removed and replaced with context-1260
dependent pairs for which both objects could occur in either context. These models were trained 1261
no the same task, but the context-dependent pairs were reconfigured to maintain the overall 1262
object set, with second-item assignments shuffled across contexts. Model input and output 1263
dimensions were therefore reduced to 10, corresponding to the 10 unique object encodings 1264
needed to instantiate this modified pair set. All other training parameters and analysis of 2AFC 1265
performance were identical to those in the main text. 1266
1267
As shown in Fig. S6, model performance across weight initializations closely mirrored the results 1268
of the main analysis (Fig. 4A), indicating that learning dynamics and context-dependent accuracy 1269
were unaffected by the presence or absence of the context-specific object. This indicates that 1270
such objects did not serve as reliable context cues for our models. 1271
1272
1273
Fig. S6. 2AFC performance with no context-specific objects measured by weight variance. 1274
2AFC accuracy (y-axis) on context-dependent test trials for GRU models with weights initialized 1275
with increasing variance along the x-axis color-coded by question category. 1276
1277
S5 Determination of hidden layer size 1278
1279
The main paper analyzes a GRU model with 150 hidden layer units. To assess whether model 1280
capacity influenced learning performance, we trained GRU models with reduced hidden layer 1281
sizes of 50 and 100 units. As shown in Fig. S7, all models ultimately achieved comparable 1282
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
29
performance converging to approximately 90% accuracy. However, models with fewer hidden 1283
layer units exhibited slower learning trajectories, requiring more training to reach the same level 1284
of performance as the model with 150 units. These results suggest that while increasing the 1285
number of hidden units accelerates learning, overall task performance is largely independent of 1286
model size. 1287
1288
1289
Fig. S7. 2AFC task performance for GRU models with varying hidden layer sizes. 1290
(A-C) 2AFC performance accuracy on context-dependent questions averaged for 50 individual 1291
models for each architecture after each epoch of training for Context A (orange), Context B 1292
(green), and both contexts combined (blue) for GRU models with (A) 50 hidden layer units, (B) 1293
100 hidden layer units, and (C) 150 hidden layer units. 1294
1295
1296
References
1297
1298
1. A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep 1299
Convolutional Neural Networks in Advances in Neural Information Processing Systems, 1300
(Curran Associates, Inc., 2012). 1301
1302
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.