Spontaneous emergence of context-dependent statistical learning in humans and neural networks

doi:10.64898/2026.03.17.712206

Spontaneous emergence of context-dependent statistical learning in humans and neural networks

2026 · doi:10.64898/2026.03.17.712206

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 117,306 characters · extracted from oa-pdf · 12 sections · click to expand

Abstract

11 12 Humans readily extract statistical regularities from experience, yet natural environments require 13 flexible adaptation when associative structures shift across changing contexts, often without 14 warning. Across two experiments, we show that humans can incidentally learn overlapping and 15 conflicting visual associations even when contexts dynamically alternate and remain unsignaled 16 or only minimally cued. To probe the computational mechanisms supporting this adaptive 17 capacity, we trained recurrent neural networks with gated recurrent units on the same statistical 18 learning task without providing any explicit context information. These models spontaneously 19 developed distributed internal representations that robustly separated conflicting associations and 20 supported rapid adaptation to latent context shifts. Critically, we show that these distributed 21 representations, strongly shaped by the model’s initial weight configuration, played a key role in 22 preventing catastrophic interference between contexts. Together, these behavioral and 23 computational results significantly advance our understanding of how humans and artificial 24 systems can successfully learn and flexibly retrieve context-dependent associations under 25 challenging conditions. 26 27 28

Introduction

29 30 Many everyday experiences unfold in structured, predictable ways, with events that recur over 31 time in stable patterns. Internalizing these regularities allows anticipation of future occurrences, 32 facilitating efficient information gathering, decision-making, and behavioral adaptation. It follows 33 that the human brain is fundamentally oriented toward predicting the upcoming future based on 34 recent events.1–3 This predictive ability helps conserve cognitive resources by reducing the need 35 for continuous, effortful learning once patterns have been identified.4 However, the world is rarely 36 static: associations often vary across contexts.5 To support adaptive behavior, the brain is 37 thought to engage in context-dependent learning of these regularities and associations for flexible 38 predictions as environmental conditions shift.6 For example, navigating a daily commute relies on 39 learning the timing and location of traffic congestion, and expectations for social interaction may 40 differ when a friend is encountered at work versus at a party. In both cases, prior experience 41 supports the formation of context-bound predictions that guide perception and behavior. 42 43 Humans have an innate ability for statistical learning, allowing them to spontaneously discover 44 regularities and associations. This process extracts spatial and temporal regularities from sensory 45 input through passive exposure, without explicit instruction or external rewards.7 Statistical 46 learning is proposed to support a wide range of cognitive functions, including language 47 acquisition, visual perception, object recognition, and social cognition.9,10 Empirical studies 48 demonstrate that individuals can detect regular patterns in continuous streams of stimuli across 49 visual,11 auditory,12 and tactile13 modalities, in the absence of explicit transition cues and 50 instructions. 51 52 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 2 Despite the rich literature on statistical learning, most research has focused on simple, highly 53 reliable associations, such as detecting short sequences of objects or sounds. However, in 54 natural environments, context plays a critical and often unobserved role in shaping how 55 associations are formed. In animal learning research, for example, association-based behavior is 56 known to be highly context-specific: extinguished fear responses return when animals are tested 57 outside the extinction setting.14 Moreover, following extinction or reversal learning, animals 58 reacquire original contingencies more rapidly than during initial learning,15,16 suggesting that prior 59 contingencies are retained as latent knowledge in memory rather than being overwritten by new 60 learning. Cognitive control processes are thought to underpin the behavioral flexibility afforded by 61 suppressing previously useful but no longer relevant responses, allowing learners to pivot 62 between contexts and contingencies as the environment demands.17 Notably, most studies in 63 animal learning literature involve explicit reinforcement (e.g., reward or punishment), whereas 64 statistical learning occurs incidentally without feedback, instruction, or overt motivation. 65 66 Although a few studies have explored the statistical learning of regularities that depend on a 67 latent context or environment (e.g., 18–22), it remains unclear whether individuals can incidentally 68 learn and retrieve context-dependent temporal associations without explicit perceptual context 69 cues, reinforcement, or instruction. Analogous mechanisms have been proposed in sensorimotor 70 learning, an instance of implicit learning where the brain is thought to infer context shifts and 71 partition experience into distinct memories.23 Here, we test whether people can acquire two 72 distinct sets of temporal associations instantiated with an overlapping pool of visual objects, 73 where most associations are in direct conflict between contexts. For example, in Context A, 74 Object X is followed by Object Y, whereas in Context B, the same Object X is followed by Object 75 Z. Successful learning requires participants to flexibly update their expectations according to the 76 active context inferred from recent sequence history. We examine how well human learners can 77 discover these context-dependent associations without any external context cue – where context 78 is embedded only in the pattern of transitions – using both offline testing and online learning 79 measures. 80 81 To explore how these context-dependent representations might emerge from experience, we 82 trained neural network models on the same behavioral task. We then identified the model that 83 best matched human performance across the experimental conditions and analyzed its hidden-84 layer activations to generate testable hypotheses about analogous representations in the human 85 brain. Deep neural networks have proven effective at capturing lower-level sensory processing,24 86 and recent perspectives advocate for extending these approaches to the study of higher-order 87 cognition, including the representation of abstract knowledge.25 However, a common limitation of 88 these modeling efforts is that these networks are typically trained on far more data than human 89 learners (see 26 for a review), limiting the validity of direct comparisons. Additionally, prior 90 modeling work frequently incorporates strong inductive biases that render context artificially 91 explicit, either by feeding an unambiguous context signal into the input22,27 or by augmenting 92 network architecture with designated units or computation modules.28,29 These modifications, 93 while effective, constrain opportunities to observe how context discovery might emerge 94 spontaneously. Inspired by Elman’s finding that simple recurrent neural networks can capture 95 both short- and long-range dependencies30 and echoing recent calls to avoid hard-wiring 96 solutions in cognitive modeling,31 we used minimally structured architectures that omitted context 97 signaling and specialized inference modules. This design allowed us to examine how networks 98 discover and represent latent task structure through sequence exposure alone. Finally, given 99 evidence that weight initialization scale can influence learning trajectories in neural 100 networks27,32,33, we systematically varied the initial weight magnitudes of the networks to assess 101 how this factor affects their ability to learn and distinguish context-dependent associations. 102 103 The goal of the neural network modeling is to generate hypotheses about how the brain might 104 represent context in statistical learning. The hippocampus represents two dominant neural 105 representation strategies to support memory of individual experiences and to extract regularities 106 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 3 across experiences34,35: sparse and distributed coding. Sparse codes, observed in the dentate 107 gyrus and CA2/3 subregions, involve highly selective activation of a small subset of units in 108 response to a given unit.36 Distributed representations, observed in the CA1 subregion, encode 109 inputs across overlapping patterns of activity spanning the neural population.37 We specifically 110 seek to find evidence of each of these strategies in the hidden layer activations of neural 111 networks that successfully represent context-dependent associations. 112 113 Overall, this study aims to advance our understanding of context-dependent statistical learning by 114 examining whether humans can learn and retrieve multiple conflicting statistical structures within 115 highly overlapping stimulus sets. By manipulating the presence of visual contextual cues, we 116 assess whether explicit signals of context shifts facilitate learning and whether individuals can still 117 learn context-dependent associations in their absence. In parallel, we use neural network models 118 trained on the same task to test whether artificial systems can account for human-like learning 119 dynamics, offering insight into the computational mechanisms that may support flexible, context-120 sensitive learning in the brain. 121 122

Results

123 124 Participants performed a context-dependent statistical learning task in which they viewed a 125 continuous stream of 1,600 object images (Fig. 1A). Their only task was to indicate whether an 126 “×” or “+” was embedded on each object (Fig. 1C), a perceptual judgment designed to maintain 127 attention and allow tracking of online learning via reaction times (RTs). Unbeknownst to 128 participants, the image stream was structured into object pairs specific to one of two distinct 129 contexts. Although they were told that parts of the sequence might become familiar over time, 130 they received no information about the underlying structure or the existence of multiple contexts. 131 Each context defined a unique set of temporal associations between a largely overlapping object 132 set, such that the probability of one object following another depended on the active context (Fig. 133 1B). 134 135 Following the learning phase, participants completed a two-alternative forced choice (2AFC) test. 136 Because objects appeared in both contexts, the correct association on a given trial depends on 137 the active context. Accordingly, each test trial began with a six-object sequence composed of 138 three object pairs from a single, consistent context, followed by the first item of a test pair (Fig. 139 1E). Participants were then tasked with choosing which of two objects should come next (Fig 1F). 140 Context-independent trials assessed knowledge of the context-independent pair. Context-141 dependent trials consisted of two types: in direct-conflict trials, the lure was the object paired with 142 the test cue in the other context; in indirect-conflict trials, the lure was an object not paired with 143 the test cue in either context. After each choice, participants rated their confidence on a 1-4 scale 144 (Fig. 1G). 145 146 In Experiment 1 (Unsignaled), n = 50 participants completed the task without any explicit perceptual 147 context cue. In Experiment 2 (Signaled), a separate group of n = 50 participants completed the 148 same task but with a visual context cue: a colored border (white or black) surrounding each object, 149 corresponding with active context (Fig. 1D); this border was present during both the learning phase 150 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 4 and the 2AFC test. Accuracy across the learning phase for the perceptual task was 92.2% for Expt. 151 1 and 92.7% for Expt. 2, indicating that participants attended to the stimuli during learning. 152 153 154 Figure 1. Experimental overview. (A) Visualization of the learning phase. Participants viewed a 155 uniformly paced sequence of objects separated by brief fixation periods. Each object appeared for 156 1200ms with a 450ms interstimulus interval. The sequence was organized according to the 157 temporal pair structure dictated by one of two contexts (Context A and Context B), which switched 158 every 50 pairs. The orange and green backdrops are shown for illustrative purposes only. 159 Participants performed four blocks of 200 pairs each, separated by short breaks. (B) Sample object 160 assignments to context pair structures comprising 11 unique objects. The context-independent pair 161 is the same for both contexts as shown in the first row, three of the context-dependent pairs consist 162 of the same object set with pair assignment of the second pair position different for each context 163 as shown in rows 2-4, and one context-dependent pair consists of a context-specific object in the 164 second pair position as shown in the last row. (C) Example of object embedded with “+” or “×”. 165 Participants were tasked with making a button-press response to indicate which symbol each object 166 contained; object-symbol mapping was held constant throughout the experiment. (D) Differentiation 167 of the two experiments: In Expt. 1 (Unsignaled), no context cues were shown and thus context 168 switches were entirely latent (left); in Expt. 2 (Signaled), context was indicated with a white or black 169 border around the object (right). (E) 2AFC test procedure: Example of 6 -item (3-pair) sequence 170 leading up to the test cue of a 2AFC trial. (F) Immediately following the test cue, participants chose 171 which of two candidate objects comes next in the sequence. This example is a direct -conflict 172 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 5 context-dependent trial, in which the lure corresponds to the object paired with the test cue in the 173 other context. (G) After each choice, participants made a confidence rating. 174 175 Behavioral evidence of context-dependent statistical learning 176 We observed evidence of context-dependent statistical learning with significant 2AFC performance 177 for both contexts (one-sample t-tests, Holm-Bonferroni corrected for three tests, all p < 0.001) (Fig. 178 2). When considering direct- and indirect-conflict 2AFC trials separately, we found above-chance 179 accuracy for both trial types (all one-sample t-tests p < 0.05; SI Appendix, Fig. S1). A mixed-design 180 ANOVA was conducted to examine the effects of experiment (Unsignaled vs. Signaled context, 181 between-subjects) and context -dependence (context -independent vs. context -dependent trials, 182 within-subjects) on 2AFC accuracy. There was a significant main effect of context -dependence 183 (F(1, 98) = 17.52, p < 0.001), reflecting higher performance on context-independent than context-184 dependent trials. The main effect of experiment was not significant (F(1, 98) = 1.93, p = 0.17), nor 185 was the interaction between experiment and context-dependence (F(1, 98) = 3.07, p = 0.08). We 186 used Bayesian estimation to assess equivalence of context -dependent 2AFC accuracy between 187 experiments. The posterior distribution of the mean difference was centered near zero (mean = 188 0.38%, 95% HDI [-4.1, 4.6]). Approximately 96.5% of the posterior mass fell within the predefined 189 region of practical equivalence (ROPE) of [-5%, 5%], providing evidence that the two experiments 190 yielded equivalent performance. This equivalence suggests that the border cue may have been too 191 subtle to boost context -dependent learning or that explicit contextual cues are unnecessary to 192 foster context-dependent learning beyond the contextual information that can be ascertained from 193 recent sequence history in this paradigm. Additional analyses of confidence ratings and 194 performance on remaining test tasks are reported in SI Appendix, section S1. These analyses 195 reveal that most participants showed no explicit awareness of the temporal pair structure. 196 197 198 199 Figure 2. 2AFC test performance. Bar height reflects group average 2AFC accuracy (% correct) 200 for Context A questions (left bar, orange), Context B questions (middle bar, green), and context -201 independent questions (right bar, gray). Note that Context A and Context B correspond to the first 202 and second contexts, respectively, used during the learning phase. Each dot reflects the accuracy 203 for one participant with lines connecting a participant’s performance across the two contexts. 204

Results

plotted separately for Unsignaled and Signaled conditions on the left and right, respectively. 205 Asterisks indicate significant deviation from chance performance (50%; horizontal line). ***p<0.001. 206 207 We also found evidence of an online learning effect using participants’ reaction times during the 208 learning phase when they judged whether each object contained an “×” or “+” (Fig. 1C). Because 209 these markers were consistently associated within corresponding objects across learning, faster 210 responses could reflect memory -based predictions about the identity of upcoming objects 211 consistent with rapid adaptation to temporal statistics in the sequence . We expect that over the 212 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 6 course of learning, knowledge of the temporal pair structure would facilitate faster, anticipatory 213 responses to the second item of each pair than the first item of each pair, which follows a random 214 transition between pairs. Based on evidence that RTs improve throughout an experiment 38, we 215 measure online learning as the second item RT subtracted from the first item RT, where a positive 216 value indicates an anticipation effect, and a negative value reflects possible interference from 217 context switches. Mean reaction times for each pair position are reported in SI Appendix, Table S1. 218 219 220 Figure 3. Reaction time differences reveal trajectory of online learning. Reaction time (RT) 221 difference between responses to objects in the first (item 1) and second (item 2) pair position. A 222 positive value on the y-axis shows anticipation effect plotted for each block during learning phase 223 (x-axis). Average RT difference with standard error of the mean (shaded) for context-independent 224 pairs in gray and for context-dependent pairs in blue. Linear trend significance indicated with same 225 color scheme. Linear contrast significance indicated, ***p<0.001; **p<0.01; *p<0.05. The shaded 226 areas indicate sampling error. 227 228 For context-independent pairs (Fig. 3; gray), we found a significant linear trend of RT differences 229 across blocks in the Unsignaled experiment (t(49) = 2.70, p = 0.009), suggesting increasing 230 anticipatory learning over time. However, no such trend was observed in the Signaled experiment 231 (t(49) = 0.70, p = 0.49), where RT differences appeared to stabilize after the first block. For 232 context-dependent pairs, both experiments showed a significant linear increase in RT difference 233 across blocks (Unsignaled: t(49) = 4.36, p < 0.001; Signaled: t(49) = 2.59, p = 0.013). However, 234 unlike the context-independent pairs, RTs for the predictable, item 2 objects in the Unsignaled 235 experiment were initially slower than the first, unpredictable items (negative RT effect) and 236 approached equivalence by the final block. This slowing earlier in the experiment may reflect 237 interference from frequent context switches: participants had to suppress the prediction under the 238 previously active context, which would be especially demanding during the early blocks of 239 training. This effect is slightly ameliorated in the Signaled experiment, suggesting that participants 240 may have been able to integrate the border contextual cue to facilitate online context-dependent 241 learning. Despite this initial disadvantage for second item responses, the online learning measure 242 increased over time, reaching its highest average in the final block. A mixed-design ANOVA on 243 the RT difference score with experiment as a between-subjects factor and block as a within-244 subjects factor showed no main effect of experiment (F(1, 98) = 1.48, p = 0.23). 245 246 Neural network weight initialization influences context-dependent learning 247 Having established that humans can spontaneously learn context-dependent associations from 248 exposure alone, we next turned to a computational account of this behavior using artificial neural 249 network models. Our goals were to test whether these models could similarly discover the task’s 250 latent structure without context cues and, critically, to characterize the nature of the emergent 251 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 7 representations that give rise to the context-dependent gating of associative predictions, 252 examining how specific model parameters shape this capacity. 253 254 First, we determined that recurrent neural networks with gated recurrent units (GRUs) learned the 255 task more effectively than other network architectures, including feedforward networks and 256 recurrent networks without gated units (see model comparison details in SI Appendix, section 257 S2). Next, we trained GRU models on the same amount of sequence exposure as human 258 participants. Models featured a 150-node hidden layer and were trained to predict the next item in 259 the sequence using one-hot encoded object representation for both inputs and outputs (Fig. 4A). 260 Critically, models received no explicit context information, requiring them to discover the latent 261 structure from sequence statistics to make accurate predictions. As with humans, learning was 262 evaluated with a 2AFC test. Model weights were frozen after training, and each 2AFC trial 263 presented the model with a series of seven objects (i.e. a three-pair sequence and a test cue). 264 The model then “selected” the next object in the sequence between two options, with its choice 265 determined by the object with the higher predicted probability. 266 267 268 Figure 4. GRU model’s 2AFC performance by weight variance. (A) Visualization of neural 269 network architecture comprised of 11 input units, a single GRU layer with 150 units, and 11 270 output units. (B-D) 2AFC accuracy (y-axis) on context-dependent test trials for GRU models with 271 weights initialized with increasing variance along the x-axis color-coded by question category. (B) 272 2AFC performance on Context A (orange), Context B (green), overall context-dependent (blue) 273 and context-independent (red). Chance performance (50%) indicated with gray horizontal line. (C-274 D) 2AFC performance on individual contexts, visualized for overall as well as direct-conflict (dark 275 coloring) and indirect-conflict (light coloring) trial subsets. Significant one-sample t-tests from 276 chance (Bonferroni-corrected for eight comparisons) indicated with horizontal lines at top of plot 277 color-coded in the same way. Mean human performance on direct-conflict trials of each context 278 indicated by the dashed horizontal black line. (C) Context A. (D) Context B. (E) Absolute 279 difference direct-conflict 2AFC performance between human group average and model group 280 average for each weight initialization configuration. Bar height reflects summed direct-conflict 281 2AFC performance absolute difference of Context A (dark orange) and Context B (dark green). 282 283 We systematically varied the bounds of the uniform distribution used to initialize model weights to 284 evaluate whether greater initial weight variance would accelerate convergence, motivated by prior 285 findings that initialization in neural networks can strongly influence learning dynamics.39,40 Low-286 variance initialization is commonly used as the default in neural networks. However, it remains 287 unclear whether this default choice affects a model’s capacity to learn latent structures in the 288 data. To address this, we systematically varied the weight initialization variance across a wide 289 range of values. For each weight variance initialization condition, we trained and tested 50 290 independent models and report the average performance. 291 292 Across weight initialization conditions, models with low to moderate initialized weight variance 293 achieve perfect accuracy on context-independent trials, demonstrating their ability to learn stable, 294 non-contextual associations (Fig. 4B). However, as variance of initial weights increases, 295 performance steadily declines, highlighting how excessive initial weight variance introduces 296 noise, disrupting the model’s ability to extract consistent patterns from the sequence. 297 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 8 298 The models’ learning of context-dependent associations – where context-specific conflicts must 299 be resolved – reveals more complex dynamics. The relationship between initialized weight 300 variance and context-dependent accuracy is non-monotonic, with the highest performance 301 demonstrated by models with moderate weight initialization variance within the range of (0.4-0.6) 302 (Fig. 4B). Low-variance models (0.08-0.2) demonstrate around 90% accuracy on Context B, the 303 context which the model was most recently processing at the end of training before model 304 weights were frozen, compared to below 60% accuracy for Context A, the context which was 305 previously learned and conflicted with the more recently learned associations. Increasing 306 initialized weight variance in the high-variance range (1.0-1.4) exhibits a steady decline in 307 accuracy for both contexts, indicating that their representations may be too diffuse or unstable, 308 preventing them from effectively differentiating between contexts. Notably, this non-monotonic 309 relationship between 2AFC accuracy and initialized weight variance was preserved when models 310 were trained using input representations that reflected the perceptual features of the objects (as 311 opposed to one-hot vectors), derived from computer vision models such as AlexNet (SI Appendix, 312 section S3). The same pattern held when the context-dependent pair that included a context-313 unique second item was excluded, such that all pairs comprised items that appeared in both 314 contexts (SI Appendix, section S4). This result helps rule out the possibility that recent exposure 315 to particular items served as a context cue, strengthening the interpretation that context is 316 inferred from recent sequence history. 317 318 Breaking down performance into direct-conflict and indirect-conflict trials reveals notable 319 differences in model learning. Direct-conflict trials (where the lure object is the correct answer for 320 the other context) are the most diagnostic test of context-dependent learning as they place 321 associations from different contexts in direct competition, making accurate performance 322 dependent on the use of contextual information to disambiguate the correct response. Only 323 models initialized with a weight variance of 0.6 achieved above-chance performance on Context 324 A direct-conflict trials (t(49) = 2.81, p = 0.004), though this came at the expense of reduced 325 though still significant accuracy on Context B direct-conflict trials (Fig. 4D). Low-weight models 326 perform near floor on Context A direct-conflict trials, rendering their high overall context-327 dependent accuracy misleading as it reflects only strong performance on indirect-conflict Context 328 A questions and mastery of Context B. In contrast, high-weight models show no advantage for 329 Context B direct-conflict trials, with performance on direct-conflict questions around chance for 330 both contexts. 331 332 To identify which weight initialization variance best approximated human-like behavior, model 333 performance on Context A and Context B direct-conflict trials was compared to human data. 334 Human accuracy was averaged across the Unsignaled and Signaled experiments as no 335 significant difference was observed between them. Fig. 4E visualizes the absolute difference in 336 2AFC performance between models and humans on direct-conflict trials for both contexts (human 337 performance indicated by the dashed black line in Fig. 4C, 4D). Initialized weight variance of 0.6 338 produced the clearest match to human performance, forming an elbow in the plot and achieving 339 above-chance accuracy across all question sets. 340 341 Distributed hidden layer representation strategy facilitates context-dependent learning 342 Having observed that models with initialized weight variance in the moderate range (such as 0.6) 343 successfully learned context-dependent associations without any explicit context signal as input, 344 we next examined how context information is encoded in the hidden layer activations of the 345 models. Context encoding could manifest as either a sparse representation (carried by a few 346 units) or a distributed representation (spread across many units). Therefore, we quantified two 347 complementary properties of the activations: the extent to which context sensitivity was localized 348 to a small subset of units (akin to individual “context cell” neurons that code for which context is 349 currently active), and the degree to which the currently active context was expressed a as 350 distinctive pattern of activity across many units. These analyses were conceptually motivated by 351 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 9 prior work distinguishing sparse and distributed coding strategies in hippocampal and 352 connectionist models.35,41 The analysis was conducted using the hidden-layer activations during 353 the final block of training. 354 355 The sparse representation index measures the proportion of hidden layer units that do not show 356 significant activation differences between contexts. A higher sparse representation index 357 therefore indicates that fewer units selectively encode a specific context, while a larger proportion 358 of units are context-insensitive (Fig. 5A). This measure of context sensitivity for each unit was 359 calculated with a one-way ANOVA comparing activations during exposure to Context A versus 360 Context B in the final quarter of training (block 4). Fewer nodes with significant context sensitivity 361 indicate that limited number of hidden-layer units support the context-specific representations. 362 363 The distributed representation index was derived from a representational similarity analysis of 364 hidden layer activations for the first item of each pair, the item that carries the context-dependent 365 association. The index compares the geometric distance (dissimilarity) of these representations 366 within a context versus across contexts, with normalization based on within-context consistency 367 so that more stable representations are given greater influence (Fig. 5A, right plot). Higher values 368 indicate that distinct context representations are distributed across the hidden layer. 369 370 We found that the low-variance models exhibit the sparsest representations, and the moderate-371 variance models exhibit the most distributed representations (Fig. 5A, middle plot). High-variance 372 models do not show strong evidence of either representation strategy. The 0.6-initialized model, 373 which was the only model to demonstrate significant direct-conflict 2AFC accuracy for both 374 contexts (Fig. 5B-C), exhibited the strongest evidence for the distributed over the sparse 375 representation index. 376 377 378 Figure 5. Neural network hidden layer task representation strategies. (A) Visualization of 379 computation of sparse representation index (left) and distributed representation index (right) 380 plotted for each weight variance configuration (middle; x-axis) in light green and blue, 381 respectively. (B) Significance of beta coefficients (y-axis) for multivariate regression analyses 382 using sparse (green) and distributed (blue) representation indices to predict each 2AFC question 383 category (x-axis). ***p<0.001; **p<0.01; *p<0.05. (C-D): Lesion analysis results. (C) 2AFC 384 accuracy of Context A (left) and Context B (right) questions (y-axis) as an increasing number of 385 hidden layer nodes are lesioned in descending rank order of context sensitivity (x-axis) for each 386 weight initialization configuration (rainbow coloring). (D) 2AFC accuracy performance difference 387 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 10 from no intervention for lesion analysis. (E) Context switch latency (y-axis) visualized across 388 learning phase blocks (x-axis) for each weight initialization configuration (rainbow coloring). 389 Failure to reflect a context switch is noted with a value of 51 (e.g., greater than duration of context 390 exposure). 391 392 To understand how these representation strategies supported learning, we regressed 2AFC 393 accuracy on the z-scored sparse and distributed representation indices (averaged across the 50 394 models initialized for each weight configuration; Fig. 5B). Including both predictors in the same 395 model allows us to evaluate their unique contributions to 2AFC task performance. Both indices 396 significantly predicted context-independent accuracy (Sparse: β = 0.080, p = 0.016; Distributed: β 397 = 0.081, p = 0.015). For context-dependent learning, the sparse representation index predicted 398 performance only for Context B (β = 0.17, p < 0.001) but not Context A (β = 0.007, p = 0.71), 399 consistent with its prominence in the low-variance models that disproportionately learned Context 400 B. In contrast, the distributed representation index predicted accuracy for both Context A (β = 401 0.069, p = 0.012) and Context B (β = 0.11, p < 0.001), reinforcing that this strategy more 402 effectively supports context-dependent learning, where successful learning requires retaining 403 knowledge of both contexts. 404 405 Efficient context switching facilitates expression of context-dependent knowledge 406 We next carried out a lesion simulation analysis to understand how the moderate-variance 407 models provide a better account of human behavior than the low-variance models, which is often 408 used as the default initialization in neural networks. For each model, all 150 hidden layer units 409 were ranked according to their context sensitivity index (the F-statistic of activity difference 410 between Context A and Context B). We then progressively lesioned the most context-sensitive 411 units by setting their activations to zero and re-evaluated 2AFC accuracy after each lesion step. 412 413 The moderate-variance models show a steady decline in both Context A and Context B accuracy 414 as more nodes were lesioned (Fig. 5C). This result further indicates a distributed representation 415 strategy where many units contribute uniquely to the representation of current context. In 416 contrast, the low-variance models show evidence of a redundant coding strategy: accuracy, 417 particularly for Context B, remains largely unchanged until around half of the hidden layer was 418 lesioned (Fig. 5C). This delayed performance decline complements the earlier finding of a sparse 419 representation strategy, in which very few nodes showed significant context sensitivity, 420 suggesting that most units carried only shallow, overlapping context signals. Then, when Context 421 B performance began to decline, Context A performance actually increased (Fig. 5D), with 422 performance eventually reaching level comparable to the moderate-weight models (Fig. 5C). This 423

Result

indicates that the Context A representations are present in the knowledge base of the 424 network. However, the context knowledge is not accessible for the 2AFC testing task given the 425 low-variance models show excellent Context B accuracy (the context on which it was more 426 recently trained) but extremely poor Context A accuracy (the previous context) (Fig. 4A). 427 428 To explain the discrepancy in 2AFC accuracy in light of this evidence of Context A knowledge 429 preserved in both low- and moderate-variance models, we examined how efficiently models 430 adapted to context switches. We derived a context switch latency metric operationalized as the 431 number of pairs the model processed after a context switch before perfectly predicting the paired 432 associate of all remaining pairs in each 50-pair context exposure. We found that, by the final 433 block of training, the moderate-variance models exhibited faster switch latencies whereas the 434 low-variance models adapted more slowly (Fig. 5F). This inefficiency likely prevented the low-435 variance models from shifting away from their end-of-training state during the brief six-item 436 exposure provided in each 2AFC trial, leaving them biased toward Context B despite evidence of 437 retaining earlier Context A associations. Overall, this evidence indicates that the moderate-438 variance models achieve the best context-dependent learning because they retain associative 439 knowledge across contexts and quickly adapt to context switches with support from a distributed 440 representation of context across the hidden layer. 441 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 11 442

Discussion

443 444 Given that statistical learning enables associative strengths to be incrementally updated over 445 many exposures, it has been unclear whether it affords sufficient flexibility to adapt to changing 446 associative contingencies in different contexts. The present results extend our understanding of 447 when incidental learning of temporal regularities is possible via demonstration of context-448 dependent statistical learning under circumstances where contexts dynamically alternate, 449 associations directly conflict, and no explicit instructions are provided. We found evidence for this 450 in above-chance performance on the final 2AFC test as well as progressive RT speeding for 451 predictable objects compared to unpredictable objects over the course of learning, which 452 suggests that implicit learning mechanisms facilitated anticipatory behavior.42 453 454 Notably, explicitly signaling context with a colored border (Expt. 2) did not enhance context-455 dependent learning compared to when context was fully latent (Expt. 1). This may reflect greater 456 influence of local temporal context (e.g., recent sequence history, which was available during 457 both learning and retrieval) over environmental cues or disruption of implicit learning mechanisms 458 by promoting a more deliberate strategy. Indeed, recent work suggests that states of reduced 459 executive control, such as mind wandering, can enhance statistical learning relative to focused 460 on-task states,43 implying that exogenously focusing attention via explicit cues may be 461 counterproductive for this type of incidental learning. However, given that the border cue changed 462 color only every few minutes and its relevance was not explicitly conveyed, it is also possible that 463 this visual cueing of context may have been too subtle to provide a performance advantage; a 464 more salient context signal might have produced different effects. 465 466 Prior efforts to demonstrate context-dependent statistical learning with auditory stimuli have been 467 unable to find learning of both contexts unless participants were provided with explicit instructions 468 or salient context cues.18,19 Our success in the visual perceptual domain supports accounts 469 suggesting that statistical learning mechanisms may be modality-specific rather than fully domain-470 general,44 with visual statistical learning potentially more robust to context-based interference 471 under implicit learning conditions. Siegelman et al.20 reported some evidence of context-472 dependent learning in the visual domain using associative structures built from an overlapping set 473 of stimuli. However, their paradigm involved a single consecutive exposure to each of the two 474 contexts, rather than repeated interleaved context switching, self-paced stimulus sequence 475 exposure, and explicit instructions to look for patterns in which shapes tended to follow each 476 other. Such design choices are different from the present study that focuses on shorter, fixed-477 duration stimulus presentations to minimize possibilities for strategic encoding and support the 478 passive, implicit learning that is thought to characterize statistical learning.45 479 480 The present findings build on prior work on second-order conditional (SOC) sequence learning, 481 which has demonstrated that learners can extract higher-order temporal dependencies in which 482 predictability depends on combinations of preceding elements rather than simple pairwise 483 transitions.42,46 Recent work further suggests that exposure to SOC structure can shape 484 subsequent performance and subjective sensitivity to sequence regularities even when explicit 485 knowledge is limited.47 Although the surface structure of these tasks differs from the present 486 paradigm, both lines of work underscore how context-sensitive behavior can emerge from the 487 integration of temporal regularities over experience, without requiring explicit contextual signals. 488 489 We used a neural network modeling approach to inform hypotheses of how the human brain 490 might support such learning. These models were optimized to predict the next object in the 491 sequence. While our human participants engaged in a cover task requiring simple ×/+ perceptual 492 judgments, we assume that they were implicitly forming predictions about upcoming stimuli. 493 Therefore, the models’ predictive framework captures a core computational goal that the human 494 learners pursue implicitly: anticipating future input based on recent experiences.1 495 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 12 496 The GRU’s gating architecture may support its successful context-dependent learning by 497 enabling the model to manage conflicting associations based on retaining relevant information 498 while filtering out noise. This computational function parallels how the human brain manages 499 interference between new and old memories.48 Although GRU models are not intended as 500 models of biological mechanisms, the update and reset gates bear resemblance to the dynamic 501 interplay between the hippocampus and neocortex that supports stability for long-term memory 502 storage34 as well as to neuromodulatory systems where prediction error signals (i.e., dopamine 503 release) prompt a reassessment of context and switch in behavioral strategy.49 Such mechanisms 504 have been hypothesized to facilitate segmenting continuous experiences and recalibration of 505 predictions,50 which may be relevant for context-dependent temporal associative learning and 506 motivate hypotheses for future work examining parallels between biological systems and 507 computational models. 508 509 Prior neural network modeling of context-dependent learning has imbued neural networks with 510 specialized architecture to facilitate latent cause inference.28,29 or have explicitly provided 511 unambiguous context information in model input.22,27 For example, Smith and colleagues51 512 effectively demonstrated that recurrent networks can track temporal structure across multiple 513 timescales within explicitly signaled contexts in a statistical learning paradigm instantiated as 514 games that share response choices. However, these studies bypass the question of how a sense 515 of context might emerge organically from exposure alone to disambiguate overlapping task 516 structure. Additionally, they introduce assumptions that are arguably biologically implausible, such 517 as constant context monitoring and perfectly reliable context cues.5 Here, we more directly focus 518 on latent context discovery by exploring how weight initialization affects learning dynamics. Since 519 network weights are adjusted throughout training to minimize loss, their initial configuration acts 520 as a key driver of convergence.52 Prior work suggests that higher initial weight magnitudes bias 521 models toward “lazy” solutions, involving rapid solution convergence with unstructured 522 representations, while smaller magnitudes support “rich” solutions that exhibit more structured 523 learning albeit at a slower pace.27,33,40 524 525 Indeed, increasing the variance of the uniform distribution used to initialize model weights to a 526 moderate range facilitated successful context-dependent 2AFC performance. This improvement 527 was associated with a high-dimensional, distributed code in the hidden layer that was significantly 528 associated with 2AFC trials of both contexts. This is consistent with studies suggesting that high 529 dimensional codes afforded by mixed selectivity in prefrontal cortex neurons allow for more 530 flexibility and rapid adaptation to new tasks.41,53 The successful distributed context coding 531 strategy where identical model input is represented differently when processed in different 532 contexts is consistent with reports of the hippocampus integrating contextual information into 533 stimulus representations.54,55 Furthermore, the hippocampus supports the rapid learning of 534 temporal associations.37,56 Taken together, these parallels suggest that the moderate-variance 535 GRU models are capturing both higher-level contextual encoding and lower-level temporal 536 associations, consistent with core functions of the hippocampus. 537 538 The variance of weight initialization may be interpreted as shaping the GRU’s inductive bias: the 539 assumptions the model makes about the structure of the environment, particularly regarding the 540 presence and separability of underlying contexts. Low initial weight variance appeared to bias the 541 model towards rigid representations that emphasize on recently experienced associations and 542 failed to recover earlier learned patterns following context shifts. On the other end, high initial 543 weight variance produced overly flexible representations that failed to consolidate stable 544 structure. Our analyses suggest an optimal intermediate range of initial weight magnitudes, where 545 models were sufficiently flexible to distinguish between contexts yet structured enough to 546 preserve associations within each context and avoid catastrophic interference. Accordingly, these 547 effects are best understood as emergent inductive biases shaped by properties of training 548 dynamics and initialization, which may provide insight into how learning systems come to 549 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 13 represent and segregate latent contexts, whether biological or artificial. Future work could assess 550 the extent to which similar learning dynamics arise across architectures and task demands and 551 whether manipulating network hyperparameters, such as learning rate and number of hidden 552 layers, consistently shape the balance plasticity and stability in context-dependent learning. 553 554 To better understand why the moderate variance models succeeded, it is informative to examine 555 the limitations of the low variance models. These models performed poorly on Context A test 556 trials, a pattern that might initially suggest catastrophic interference – that previously learned 557 associations of Context A were overwritten by more recent Context B experience. However, the 558 lesion analysis revealed that Context A knowledge remained in the networks but was not 559 accessible until over half the hidden layer was removed. One likely explanation for this 560 inaccessibility is the slower context switching in low variance models: compared to the moderate 561 variance models, which successfully expressed knowledge of both contexts, the low variance 562 models were slower to accommodate context switches. As a result, the brief context exposure 563 sequences preceding each 2AFC decision may not have provided sufficient evidence to pull them 564 out of their orientation towards Context B state at test, which remained simply because Context B 565 was the last context encountered during training. This phenomenon parallels findings from the 566 fear extinction literature, where extinguished fear responses can re-emerge in a different context, 567 indicating that underlying knowledge is retained but not manifested in behavior when irrelevant to 568 current setting.14 569 570 Another key limitation of the low variance models that emerged from the lesion results was a 571 constraint on how knowledge was represented in the hidden layer units. Before Context A 572 performance recovered, these models showed little to no change in 2AFC accuracy for either 573 context until roughly half the hidden layer was lesioned, in contrast to the steady performance 574 decline observed in moderate and high variance models. This suggests highly redundant coding 575 within the hidden layer. Redundant neural coding is theorized to enhance robustness in noisy 576 environments by duplicating information across neural populations53,57,58 – a potentially 577 advantageous feature for the present task, where many associations directly conflict and half of 578 the training samples are unreliable (e.g., between-pair transitions). Such redundancy could 579 plausibly account for why the low variance models achieved the strongest accuracy on Context B. 580 However, although this redundant coding strategy may help stabilize performance within a single 581 context amidst overall environmental instability, it ultimately proved ineffective because it limited 582 the rapid adaptability needed to operate in a dynamic environment with multiple context-583 dependent structures, resulting in a failure to express knowledge of both contexts. 584 585 Mirroring the diversity of these computational profiles, humans also exhibited considerable 586 variability. Although performance on context-dependent trials was significantly above-chance at 587 the group level, some participants exhibited little or no learning (akin to the high-weight models) 588 while others showed stronger learning of one of the contexts (similar to the low-weight models). 589 Just as some GRU models required more exposure to learn both sets of associations, certain 590 individuals may also need more input to reach stable learning. A promising future direction is to 591 identify model parameters that reflect these individual differences and predict how quickly a 592 learner converges on context-dependent associations, potentially linking such parameters to 593 developmental changes in learning efficiency.59 594 595 Taken together, our findings demonstrate that humans can spontaneously resolve conflicting, 596 context-dependent associations from passive exposure alone – even in the absence of explicit 597 instructions, self-pacing, feedback, or contextual cues. The finding that explicit signaling offered 598 no advantage over entirely latent context exposure further highlights the robustness of this 599 incidental learning mechanism, suggesting that temporal statistics alone are sufficient to drive 600 contextual inference. Our neural network modeling provides a mechanistic account for this 601 capacity, showing that successful adaptation relies on the emergence of distributed 602 representations that are influenced by weight initialization parameters, which we believe are a 603 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 14 reasonable proxy for humans’ inductive biases. This representational strategy not only maintains 604 information from multiple contexts, even when the associative structure directly conflicts across 605 contexts, but also quickly accommodates context changes. These findings suggest that the 606 human brain may rely on similar mechanisms to flexibly manage latent contextual shifts and 607 support adaptive prediction in dynamic environments. 608 609

Materials and methods

610 611 Participants 612 Participants were recruited via the UCLA Psychology Department subject pool and completed the 613 experiment in-person for course credit. All participants provided informed consent in accordance 614 with protocols approved by the UCLA Institutional Review Board (IRB#22-001719). Inclusion 615 criteria of aged between 18-40 years, native English speaker, and normal or corrected-to-normal 616 vision with contacts (no glasses) were confirmed before commencing data collection. Our goal 617 was to obtain useable data from 50 participants for each of the two experiments (Expt. 1: 618 Unsignaled context; Expt. 2: Signaled context), so enough participants were collected to reach 619 our data quality thresholds of 90% of trials responded to and 85% accuracy on trials during the 620 learning phase. These inclusion criteria were enforced to ensure that data analyses focused on 621 participants who were engaged during the learning phase of the experiment. Our final sample 622 included 50 participants for Expt. 1 (33 F / 17 M; mean age = 20.5 years) and 50 different 623 participants for Expt. 2 (40 F / 8 M / 2 Non-Binary; mean age = 20.0 years). 624 625

Materials

626 The experiment was coded and run with PsychoPy version 2024.2.460 on a Mac Mini. Stimuli 627 were displayed on a DELL P2422HE monitor with 1920 by 1080 pixel resolution and screen size 628 of 23.8 inches, which participants viewed from a fixed distance with their head stabilized with a 629 forehead and chin rest. An EyeLink 1000 eye tracker (SR Research) captured gaze location while 630 participants completed the experiment, but eye tracking data are not reported here. Experiment 631 stimuli were drawn from a set of objects created using Blender 2.48.61,62 The stimuli were visually 632 distinct in terms of shape and color and were novel to participants. Images were resized to be 633 350 pixels wide. A small “×” or “+” symbol was subtly embedded onto each object using slight 634 color contrast such that the mark was visible but did not obstruct recognition of object shape. 635 636 Learning phase 637 In the first phase of the experiment, participants were exposed to a sequence of objects 638 presented individually. The objects were presented in four different locations on the screen with a 639 width of 350 pixels and centered 300 pixels above, below, right, and left of the center of a gray 640 screen. At each object presentation, the three positions not occupied by the current object were 641 filled with phase-scrambled versions of other objects cropped into circles with diameter of 300 642 pixels (visualized in SI Appendix, Figure S2). The experimental manipulation of object location 643 was included to enable potential analyses of spatial location-based learning as indexed by 644 anticipatory eye movements. However, because the eye tracking data did not yield clear or 645 interpretable effects, we focus all analyses on object identity and omit spatial position from further 646 consideration, as well as from the task depiction in Fig. 1. Before beginning the learning phase, 647 participants were instructed that parts of the sequence might become familiar over time and that 648 they would later be asked questions about the objects they had seen. 649 650 Unbeknownst to the participants, the objects were organized into two sets (or contexts) of 5 pairs 651 of objects. The same object set was used for all participants but were randomly assigned to each 652 pair position, and each object maintained either first-of-pair (item 1) or second-of-pair (item 2) 653 position in the pair across contexts. One pair was context-independent, meaning the same two 654 objects were paired in both contexts. The other four pairs were context-dependent. Three of 655 these pairs consisted of the same set of six objects across both contexts, but the second item 656 associated with each first item was dependent on context. For example, Object X is paired with 657 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 15 Object Y in Context A but with Object Z in Context B. The last context-dependent pair shared the 658 same first object across both contexts but the paired second object was specific to each context 659 (e.g., only appeared in that context and not the other). In this way, these four context-dependent 660 pairs shared a first item across contexts but the second item of each pair was dependent on 661 context. In total, a set of 11 unique items were used to instantiate the five pairs in each context. 662 663 Throughout this learning phase, participants were tasked with responding to whether the object 664 onscreen was marked with an “×” or “+”. Therefore, reaction time to the perceptual question could 665 be evaluated as online measures of pair structure learning. Objects were presented for 1200ms 666 with a 450ms interstimulus interval. 667 668 The two experiments differed on with respect to whether context was Unsignaled (Expt. 1) or 669 Signaled (Expt. 2) with a border around the objects that was white or black depending on the 670 context. 671 672 Two-alternative forced choice (2AFC) task 673 In the first of three test tasks immediately following this learning phase, participants completed a 674 two-alternative forced choice (2AFC) task. Because the object associations were dependent on 675 active context for all but the one context-independent pair, on each 2AFC trial participants were 676 presented with a sequence of seven objects (consisting of three pairs from one of the contexts 677 and the first item of the test pair) before being presented with two side-by-side alternatives as to 678 which object they think should come next (one was the correct paired associate of the test pair 679 and the other was a lure). Objects were presented with the same timing as used during the 680 learning phase in the sequence, and participants were given unlimited time to make a choice 681 between target and lure. Participants completed a total of 54 questions: 6 of these questions 682 evaluated the context-independent pair, while 48 questions evaluated context-dependent 683 associations. The 48 questions probing context-dependent associations could either feature a 684 lure object that was the correct paired associate in the other context (direct-conflict; 16 685 questions), or a lure that was any other item (indirect-conflict; 32 questions). The “×” and “+” 686 markings were removed from the objects to make clear that participants no longer were required 687 to respond to the perceptual question. For the Unsignaled experiment, no explicit context cues 688 were provided; for the Signaled experiment, the border around the objects was colored white or 689 black on each trial to cue contexts. After making each 2AFC judgment, participants were 690 prompted to rate their confidence in their decision from 1-4. 691 692 Structure knowledge probe 693 After completing all 2AFC trials, participants were prompted to answer some questions about 694 what they learned during the experiment. First, they were asked to respond yes or no to whether 695 they observed any predictable patterns in the experiment. Second, they were asked to describe 696 any patterns they observed in the sequence. Third, they were asked to describe any rules that 697 governed which object would come next in the sequence. The idea was to progressively prompt 698 participants to indicate any knowledge of the pair structure underlying the sequence they 699 observed that were increasingly straightforward to get an idea of how much knowledge was 700 explicit. 701 702 Pair reconstruction task 703 The last task allowed participants to demonstrate explicit knowledge of pairs. Participants were 704 presented with a bank of all 11 objects at the top of the screen and provided with 20 sets of two 705 empty squares side-by-side presented in 4 rows of 5 columns. Participants were instructed to 706 organize the objects into related pairs by placing one object in each square of a pair, with the left 707 and right positions corresponding to the first and second items in the pair. Participants were told 708 that each item could be used more than once and that they did not have to fill out all of the pairs. 709 710 Data cleaning 711 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 16 Data exclusion criteria were enforced to ensure that participants were engaging with the 712 experiment during the learning phase. As such, two criteria were enforced: response rate of more 713 than 90% and accuracy of more than 85% on all responses throughout the learning phase. Data 714 collection continued until 50 useable participants were collected for each experiment. 715 716 Neural network architecture 717 Recurrent neural network models with gated recurrent units were implemented with PyTorch 718 v2.0.1 63. Such models have previously been used to explore context-dependent associative 719 learning from sequences 22,27. Each model had the same architecture: an 11-node input layer with 720 dimensionality of 11 (equal to the number of objects included in the study), a hidden layer of 150 721 nodes (GRU performance with different hidden layer sizes presented in SI Appendix, section S5), 722 and an 11-D output layer again to match dimensionality of one-hot object vectors. Learning rate 723 was held constant at 0.001, and model weights were updated after each training sample using 724 the Adam algorithm of gradient descent and cross entropy loss. Default parameters were used 725 unless otherwise noted. 726 727 Training 728 A unique 1600-object sequence was generated for each model in the same way as for human 729 participants. Each neural network received one object at a time and was trained to predict the 730 identity of the next object in the sequence. Although the sequence was constructed using 731 embedded object pairs, models received no information about this underlying structure. That is, 732 the model made predictions at every time step (1599 samples for the 1600-object sequence) and 733 had no awareness of pair boundaries. The same sequence was used for all epochs of training, 734 with the hidden state was reset at the start of each epoch and between blocks (every 400 735 samples) in recurrent models to emulate the breaks taken by human participants. 736 737 2AFC task 738 After each epoch of learning, model weights were frozen, and the models were evaluated using a 739 2AFC test designed to mirror the testing procedure of the human participants. A unique set of 740 2AFC test questions was generated for each model in the same way as for human participants. 741 Before each trial, hidden layer activity was reset to zero. Then, a sequence of three pairs from 742 one of the contexts was presented as the hidden state evolved, allowing the model to infer the 743 active context based on the sequence. Finally, the first object of the test pair was inputted, and 744 the model’s prediction of the ensuing item was evaluated. Accuracy was determined by whether 745 the probability assigned to the correct paired associate was higher than that to the lure. In most 746 analyses, accuracy is evaluated separately for the context-independent, indirect-conflict context-747 dependent, and direct-conflict context-dependent question sets to capture how well the models 748 handle conflicting information across contexts and maintain knowledge of stable, context-749 independent relationships. 750 751 Single epoch analyses 752 We tested the GRU’s ability to learn the task as the variance of the uniform distribution used to 753 initialize the hidden layer’s weights was increased. The uniform distribution was centered at zero 754 with positive and negative bounds of 0.08 (default for PyTorch with 150 nodes), 0.2, 0.4, 0.6, 0.8, 755 1.0, 1.2, and 1.4. Fifty independent GRU models with different weight initialization randomizations 756 were trained and tested, and learning measures across these models were averaged to ensure 757 robust performance estimates of each weight initialization category. 758 759 Learning trajectory analysis: Context switch latency 760 To assess how quickly the neural network models adapted to a context change, we developed a 761 switch latency measure. We devised a stringent operationalization of switch latency as the 762 number of first-of-pair (item 1) items (e.g., the item whose model output captures the within-pair 763 transition prediction) a model processed after a context switch before achieving perfect accuracy 764 on all remaining item 1 samples in that 50-pair context exposure. Because only the model outputs 765 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 17 of item 1 training samples are predictable and thus learnable, they served as a measure of 766 adaptation to a new context. Switch latency was calculated for all 16 context exposures (8 per 767 context) and averaged within each block (4 context exposures), yielding a single switch latency 768 value per block. We averaged this measure across all 50 trained models for each weight 769 initialization condition. 770 771 Hidden layer analyses: Context representation strategies 772 To understand how context information was represented across the hidden layer, we quantified 773 two complementary properties of the activations: the extent to which context sensitivity was 774 localized to a small subset of units (akin to individual “context cell” neurons that code for which 775 context is currently active), and the degree to which the currently active context was expressed a 776 as distinctive pattern of activity across many units. These analyses were conceptually motivated 777 by prior work distinguishing sparse and distributed coding strategies in hippocampal and 778 connectionist models.35,41 Our goal was to determine whether these representational properties 779 could explain 2AFC task performance across all individual GRU model instances of the weight 780 variance configurations. 781 782 A sparse representation describes when context sensitivity is confined to a relatively small subset 783 of hidden layer units, while the vast majority remain inactive or insensitive. To investigate sparse 784 context representations in the GRU’s hidden layer, we first used a one-way ANOVA to estimate 785 the difference in activation when processing inputs from Context A and Context B during the final 786 quarter of training (block 4) for each of the 150 hidden layer nodes. We then counted the number 787 of nodes that showed a significant activation difference. We applied a Bonferroni correction within 788 analysis of each model to control for Type I errors of the 150 comparisons were performed. The 789 corrected significance threshold was computed by dividing the original alpha level (0.05) by the 790 number of comparisons (150), yielding an adjusted significance level of p < 0.00033. Based on 791 this threshold, we determined that Fcrit(1,398) = 12.75 and calculated the sparse representation 792 index as the proportion of hidden layer nodes that did not show a significant difference in 793 activation between contexts, such that a larger value reflects a sparser context representation. 794 795 A distributed representation was computed using a representational similarity analysis (RSA; 64) 796 focused on the hidden layer activations after processing the first item of each pair (capturing the 797 context-dependent prediction) during the final quarter of training (block 4). This included a total of 798 200 hidden state samples (100 per context). These activations were divided into two split-halves, 799 each containing 10 samples for each of the five pairs per context. These samples in each split-800 half were evenly divided into those drawn from the first half of a context exposure and those from 801 the second half, controlling for any strengthening of context representation over time. We 802 averaged the hidden state activation within each node for each object within each context. We 803 then computed the Pearson correlation coefficient for all pairwise comparisons of objects within 804 and across contexts, producing an RSA matrix. This matrix compared the split-half object 805 representations, with one half plotted along the x-axis and the other along the y-axis, and there 806 were 10 cells along each axis for each of the five pairs viewed in each context. The upper-left and 807 lower-right quadrants contained correlations between the five pairs from the same context. The 808 upper-right and lower-left quadrants contained correlations between the same objects when 809 viewed in opposing contexts. To quantify the distributed representation, we calculated the 810 difference between the average within-context and between-context correlations for each object, 811 normalized by subtracting the average within-context correlation from one. This normalization 812 penalized models with lower within-context stability because a larger denominator as a result of 813 lower within-context similarity would decrease the overall distributed representation index, 814 ensuring that observed differences in between-context representation were not artifacts of noisy 815 or unstable object representations. 816 817 Hidden layer analyses: Lesion analysis 818 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 18 To further understand how the GRU model’s hidden layer representations supported successful 819 context-dependent learning, we conducted an intervention analysis. For all weight configurations, 820 we trained a new set of 50 models with the same single-epoch training procedure and then 821 systematically tested their performance on the 2AFC task while “lesioning” (zeroing out) subsets 822 of hidden layer nodes. Importantly, these nodes were active during training on the 1600-object 823 sequence, and the lesioning intervention was applied only immediately prior to the 2AFC testing 824 phase. 825 826 We first calculated the absolute context sensitivity of each hidden layer node using a one-way 827 ANOVA of activations between Context A and Context B during the final quarter of training (block 828 4), in the same way as computing the sparse representation index. Nodes were then ranked from 829 the most context-sensitive (largest F-statistic) to least sensitive. During the 2AFC task, subsets of 830 hidden layer nodes were progressively lesioned, beginning with the most context-sensitive nodes. 831 We evaluated models under the following lesioning conditions: 0 (no nodes lesioned to obtain 832 baseline performance estimate), 1, 5, 10, 25, 50, 75, 100, 125, 130, 135, 140, 145, and 150 (all 833 nodes lesioned with expectation of chance performance). We report the average 2AFC 834 performance on Context A, Context B, and context-independent question sets, expressed as both 835 obtained accuracy and the change in performance relative to the no-lesion baseline (e.g., when 836 no intervention is applied). 837 838 Statistical Analysis 839 2AFC task 840 2AFC task performance was assessed by evaluating accuracy and average confidence rating on 841 subsets of 2AFC questions. Group-level accuracy was tested against 50% chance using a one-842 sample Student’s t-test with Holm-Bonferroni correction applied for three comparisons (Context A, 843 Context B, and context-independent trials) within each experiment. 844 845 We conducted a mixed-design ANOVA to examine the effects of experiment (between-subjects 846 factor: Expt. 1 versus Expt. 2) and context-dependence (within-subject factor: context-dependent 847 versus context-independent trials) on 2AFC accuracy using the Python pingouin package. To 848 assess whether context-dependent 2AFC accuracy was statistically equivalent between 849 experiments, we used Bayesian estimation with a region of practical equivalence (ROPE) 850 approach. We computed the posterior distribution of the mean difference in accuracy between 851 Expt. 1 and Expt. 2 and quantified the proportion of the posterior mass falling within a predefined 852 ROPE of [-5%, 5%]. This ROPE was selected to reflect the smallest effect size of interest, 853 consistent with typical variability in task accuracy in statistical learning literature. Posterior 854 distributions were estimated using the PyMC package. 855 856 Online learning assessment 857 We quantified online learning using participants’ RTs during the learning phase, in which they 858 indicated whether each object contained an “×” or “+”. For each block, we computed an 859 anticipation score as the average RT to the second item of each pair subtracted from the average 860 RT to the first item of each pair. This metric captured facilitation for predictable second items 861 while controlling for overall RT drift throughout the session, as first items follow unpredictable 862 transitions. Positive values indicate faster responses to second items relative to first items. 863 864 To assess changes in online learning across blocks, we applied linear contrast with weights [-3, -865 1, 1, 3] to the blockwise anticipation scores for each participant and tested the group-level 866 difference from zero using a two-tailed one-sample t-test for each experiment. 867 868 Multivariate regression analysis of representation strategies 869 The two measures of hidden layer activity – sparse representation index (proportion of hidden 870 layer nodes that do not show significant activation difference by context) and distributed 871 representation index (within- versus between-context correlation differences) – were used as 872 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 19 predictors in multivariate regression models aimed to explain the variance of 2AFC context-873 dependent accuracy, context-independent accuracy, and accuracy difference between contexts. 874 These indices were first computed for each of the 50 models instantiated with each of the eight 875 weight initialization configurations, and then were averaged within each configuration. The 876 resulting eight values for each predictor were z-scored to enable comparison of beta coefficients 877 across predictors. By including both metrics in the same regression models, we assessed their 878 unique contributions to task performance. This allowed us to evaluate whether sparse or 879 distributed representations were more predictive of learning outcomes, providing insight into the 880 mechanisms underlying the GRU model’s ability to process and adapt to context-dependent 881 associations. 882 883

Acknowledgements

884 885 F.P. was supported by the National Science Foundation Graduate Research Fellowship Program 886 under Grant Nos. DGE-2034835 and DGE-2444110. 887 888 Author Contributions 889 890 Conceptualization: FCP, JR, HL; Methodology: FCP, JR, HL; Software: FCP; Formal analysis: 891 FCP; Visualization: FCP; Supervision: JR, HL; Writing – original draft: FCP; Writing – review & 892 editing: FCP, HL, JR. 893 894 Declaration of interests 895 896 The authors declare no competing interests. 897 898

References

899 900 1. Friston, K. (2010). The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 901 127–138. https://doi.org/10.1038/nrn2787. 902 2. Bar, M. (2009). The proactive brain: memory for predictions. Philos. Trans. R. Soc. Lond. B. 903 Biol. Sci. 364, 1235–1243. https://doi.org/10.1098/rstb.2008.0310. 904 3. Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of 905 cognitive science. Behav. Brain Sci. 36, 181–204. 906 https://doi.org/10.1017/S0140525X12000477. 907 4. Summerfield, C., and de Lange, F.P. (2014). Expectation in perceptual decision making: 908 neural and computational mechanisms. Nat. Rev. Neurosci. 15, 745–756. 909 https://doi.org/10.1038/nrn3838. 910 5. Heald, J.B., Lengyel, M., and Wolpert, D.M. (2023). Contextual inference in learning and 911 memory. Trends Cogn. Sci. 27, 43–64. https://doi.org/10.1016/j.tics.2022.10.004. 912 6. Heald, J.B., Wolpert, D.M., and Lengyel, M. (2023). The Computational and Neural Bases of 913 Context-Dependent Learning. Annu. Rev. Neurosci. 46, 233–258. 914 https://doi.org/10.1146/annurev-neuro-092322-100402. 915 7. Statistical Learning (2015). 501–506. https://doi.org/10.1016/B978-0-12-397025-1.00276-1. 916 8. Statistical Learning (2015). In Brain Mapping (Elsevier), pp. 501–506. 917 https://doi.org/10.1016/b978-0-12-397025-1.00276-1. 918 9. Sherman, B.E., Graves, K.N., and Turk-Browne, N.B. (2020). The prevalence and importance 919 of statistical learning in human cognition and behavior. Curr. Opin. Behav. Sci. 32, 15–20. 920 https://doi.org/10.1016/j.cobeha.2020.01.015. 921 10. Saffran, J.R., and Kirkham, N.Z. (2018). Infant Statistical Learning. Annu. Rev. Psychol. 69, 922 181–203. https://doi.org/10.1146/annurev-psych-122216-011805. 923 11. Fiser, J., and Aslin, R.N. (2001). Unsupervised Statistical Learning of Higher-Order Spatial 924 Structures from Visual Scenes. Psychol. Sci. 12, 499–504. https://doi.org/10.1111/1467-925 9280.00392. 926 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 20 12. Saffran, J.R., Aslin, R.N., and Newport, E.L. (1996). Statistical Learning by 8-Month-Old 927 Infants. Science 274, 1926–1928. 928 13. Conway, C.M., and Christiansen, M.H. (2005). Modality-Constrained Statistical Learning of 929 Tactile, Visual, and Auditory Sequences. J. Exp. Psychol. Learn. Mem. Cogn. 31, 24–39. 930 https://doi.org/10.1037/0278-7393.31.1.24. 931 14. Bouton, M.E. (1993). Context, time, and memory retrieval in the interference paradigms of 932 Pavlovian learning. Psychol. Bull. 114, 80–99. https://doi.org/10.1037/0033-2909.114.1.80. 933 15. McAllister, D.E., and McAllister, W.R. (1994). Extinction and Reconditioning of Classically 934 Conditioned Fear before and after Instrumental Learning: Effects of Depth of Fear Extinction. 935 Learn. Motiv. 25, 339–367. https://doi.org/10.1006/lmot.1994.1018. 936 16. Bouton, M.E. (2004). Context and Behavioral Processes in Extinction. Learn. Mem. 11, 485–937 494. https://doi.org/10.1101/lm.78804. 938 17. Izquierdo, A., and Jentsch, J.D. (2012). Reversal learning as a measure of impulsive and 939 compulsive behavior in addictions. Psychopharmacology (Berl.) 219, 607–620. 940 https://doi.org/10.1007/s00213-011-2579-7. 941 18. Weiss, D.J., Gerfen, C., and Mitchel, A.D. (2009). Speech Segmentation in a Simulated 942 Bilingual Environment: A Challenge for Statistical Learning? Lang. Learn. Dev. 5, 30–49. 943 https://doi.org/10.1080/15475440802340101. 944 19. Gebhart, A.L., Aslin, R.N., and Newport, E.L. (2009). Changing Structures in Midstream: 945 Learning Along the Statistical Garden Path. Cogn. Sci. 33, 1087–1116. 946 https://doi.org/10.1111/j.1551-6709.2009.01041.x. 947 20. Siegelman, N., Bogaerts, L., Kronenfeld, O., and Frost, R. (2018). Redefining “Learning” in 948 Statistical Learning: What Does an Online Measure Reveal About the Assimilation of Visual 949 Regularities? Cogn. Sci. 42, 692–727. https://doi.org/10.1111/cogs.12556. 950 21. Qian, T., Jaeger, T.F., and Aslin, R.N. (2016). Incremental implicit learning of bundles of 951 statistical patterns. Cognition 157, 156–173. https://doi.org/10.1016/j.cognition.2016.09.002. 952 22. Smith, C.M., Thompson-Schill, S.L., and Schapiro, A.C. (2024). Rapid Learning of Temporal 953 Dependencies at Multiple Timescales. J. Cogn. Neurosci. 36, 2343–2356. 954 https://doi.org/10.1162/jocn_a_02232. 955 23. Heald, J.B., Lengyel, M., and Wolpert, D.M. (2021). Contextual inference underlies the 956 learning of sensorimotor repertoires. Nature 600, 489–493. https://doi.org/10.1038/s41586-957 021-04129-3. 958 24. Yamins, D.L.K., and DiCarlo, J.J. (2016). Using goal-driven deep learning models to 959 understand sensory cortex. Nat. Neurosci. 19, 356–365. https://doi.org/10.1038/nn.4244. 960 25. Saxe, A., Nelli, S., and Summerfield, C. (2021). If deep learning is the answer, what is the 961 question? Nat. Rev. Neurosci. 22, 55–67. https://doi.org/10.1038/s41583-020-00395-8. 962 26. Alamia, A., Gauducheau, V., Paisios, D., and VanRullen, R. (2020). Comparing feedforward 963 and recurrent neural network architectures with human behavior in artificial grammar 964 learning. Sci. Rep. 10, 22172. https://doi.org/10.1038/s41598-020-79127-y. 965 27. Flesch, T., Juechems, K., Dumbalska, T., Saxe, A., and Summerfield, C. (2022). Orthogonal 966 representations for robust context-dependent task performance in brains and neural 967 networks. Neuron 110, 1258-1270.e11. https://doi.org/10.1016/j.neuron.2022.01.005. 968 28. Lu, Q., Nguyen, T.T., Zhang, Q., Hasson, U., Griffiths, T.L., Zacks, J.M., Gershman, S.J., and 969 Norman, K.A. (2024). Reconciling shared versus context-specific information in a neural 970 network model of latent causes. Sci. Rep. 14, 16782. https://doi.org/10.1038/s41598-024-971 64272-5. 972 29. Franklin, N.T., Norman, K.A., Ranganath, C., Zacks, J.M., and Gershman, S.J. (2020). 973 Structured Event Memory: A neuro-symbolic model of event cognition. Psychol. Rev. 127, 974 327–361. https://doi.org/10.1037/rev0000177. 975 30. Elman, J.L. (1990). Finding Structure in Time. Cogn. Sci. 14, 179–211. 976 https://doi.org/10.1207/s15516709cog1402_1. 977 31. Hasson, U., Nastase, S.A., and Goldstein, A. (2020). Direct Fit to Nature: An Evolutionary 978 Perspective on Biological and Artificial Neural Networks. Neuron 105, 416–434. 979 https://doi.org/10.1016/j.neuron.2019.12.002. 980 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 21 32. Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and 981 Generalization in Neural Networks. In Advances in Neural Information Processing Systems 982 (Curran Associates, Inc.). 983 33. Chizat, L., Oyallon, E., and Bach, F. (2019). On Lazy Training in Differentiable Programming. 984 In Advances in Neural Information Processing Systems (Curran Associates, Inc.). 985 34. McClelland, J.L., McNaughton, B.L., and O’Reilly, R.C. (1995). Why there are complementary 986 learning systems in the hippocampus and neocortex: Insights from the successes and 987 failures of connectionist models of learning and memory. Psychol. Rev. 102, 419–457. 988 https://doi.org/10.1037/0033-295X.102.3.419. 989 35. Schapiro, A.C., Turk-Browne, N.B., Botvinick, M.M., and Norman, K.A. (2017). 990 Complementary learning systems within the hippocampus: a neural network modelling 991 approach to reconciling episodic memory with statistical learning. Philos. Trans. R. Soc. B 992 Biol. Sci. 372, 20160049. https://doi.org/10.1098/rstb.2016.0049. 993 36. Leutgeb, J.K., Leutgeb, S., Moser, M.-B., and Moser, E.I. (2007). Pattern Separation in the 994 Dentate Gyrus and CA3 of the Hippocampus. Science 315, 961–966. 995 https://doi.org/10.1126/science.1135801. 996 37. Schapiro, A.C., Rogers, T.T., Cordova, N.I., Turk-Browne, N.B., and Botvinick, M.M. (2013). 997 Neural representations of events arise from temporal community structure. Nat. Neurosci. 16, 998 486–492. https://doi.org/10.1038/nn.3331. 999 38. Welford, W.T., Brebner, J.M.T., and Kirby, N. (1980). Reaction Times (Stanford University). 1000 39. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization 1001 and momentum in deep learning. In Proceedings of the 30th International Conference on 1002 Machine Learning (PMLR), pp. 1139–1147. 1003 40. Dominé, C.C.J., Anguita, N., Proca, A.M., Braun, L., Kunin, D., Mediano, P.A.M., and Saxe, 1004 A.M. (2025). From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks. Preprint 1005 at arXiv, https://doi.org/10.48550/arXiv.2409.14623 1006 https://doi.org/10.48550/arXiv.2409.14623. 1007 41. Hinton, G.E. (1986). Learning Distributed Representations of Concepts. Proc. Annu. Meet. 1008 Cogn. Sci. Soc. 8. 1009 42. Destrebecqz, A., and Cleeremans, A. (2001). Can sequence learning be implicit? New 1010 evidence with the process dissociation procedure. Psychon. Bull. Rev. 8, 343–350. 1011 https://doi.org/10.3758/BF03196171. 1012 43. Vékony, T., Farkas, B.C., Brezóczki, B., Mittner, M., Csifcsák, G., Simor, P., and Németh, D. 1013 (2025). Mind wandering enhances statistical learning. iScience 28. 1014 https://doi.org/10.1016/j.isci.2024.111703. 1015 44. Conway, C.M. (2020). How does the brain learn environmental structure? Ten core principles 1016 for understanding the neurocognitive mechanisms of statistical learning. Neurosci. Biobehav. 1017 Rev. 112, 279–299. https://doi.org/10.1016/j.neubiorev.2020.01.032. 1018 45. Perruchet, P., and Pacton, S. (2006). Implicit learning and statistical learning: one 1019 phenomenon, two approaches. Trends Cogn. Sci. 10, 233–238. 1020 https://doi.org/10.1016/j.tics.2006.03.006. 1021 46. Cleeremans, A., and McClelland, J.L. (1991). Learning the structure of event sequences. J. 1022 Exp. Psychol. Gen. 120, 235–253. https://doi.org/10.1037/0096-3445.120.3.235. 1023 47. Chiarella, S.G., Simione, L., D’Angiò, M., Saracini, C., Raffone, A., and Di Pace, E. (2026). 1024 Implicit observational learning of second-order conditional repeated sequences presented in 1025 rapid serial visual presentation. Conscious. Cogn. 137, 103967. 1026 https://doi.org/10.1016/j.concog.2025.103967. 1027 48. O’Reilly, R.C., and Rudy, J.W. (2001). Conjunctive representations in learning and memory: 1028 Principles of cortical and hippocampal function. Psychol. Rev. 108, 311–345. 1029 https://doi.org/10.1037/0033-295X.108.2.311. 1030 49. Glimcher, P.W. (2011). Understanding dopamine and reinforcement learning: The dopamine 1031 reward prediction error hypothesis. Proc. Natl. Acad. Sci. 108, 15647–15654. 1032 https://doi.org/10.1073/pnas.1014269108. 1033 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 22 50. Zacks, J.M., Kurby, C.A., Eisenberg, M.L., and Haroutunian, N. (2011). Prediction Error 1034 Associated with the Perceptual Segmentation of Naturalistic Events. J. Cogn. Neurosci. 23, 1035 4057–4066. https://doi.org/10.1162/jocn_a_00078. 1036 51. Smith, C.M., Thompson-Schill, S.L., and Schapiro, A.C. (2024). Rapid Learning of Temporal 1037 Dependencies at Multiple Timescales. J. Cogn. Neurosci. 36, 2343–2356. 1038 https://doi.org/10.1162/jocn_a_02232. 1039 52. Narkhede, M.V., Bartakke, P.P., and Sutaone, M.S. (2022). A review on weight initialization 1040 strategies for neural networks. Artif. Intell. Rev. 55, 291–322. https://doi.org/10.1007/s10462-1041 021-10033-z. 1042 53. Rigotti, M., Barak, O., Warden, M.R., Wang, X.-J., Daw, N.D., Miller, E.K., and Fusi, S. 1043 (2013). The importance of mixed selectivity in complex cognitive tasks. Nature 497, 585–590. 1044 https://doi.org/10.1038/nature12160. 1045 54. Mızrak, E., Bouffard, N.R., Libby, L.A., Boorman, E.D., and Ranganath, C. (2021). The 1046 hippocampus and orbitofrontal cortex jointly represent task structure during memory-guided 1047 decision making. Cell Rep. 37, 110065. https://doi.org/10.1016/j.celrep.2021.110065. 1048 55. Chanales, A.J.H., Oza, A., Favila, S.E., and Kuhl, B.A. (2017). Overlap among Spatial 1049 Memories Triggers Repulsion of Hippocampal Representations. Curr. Biol. 27, 2307-2317.e5. 1050 https://doi.org/10.1016/j.cub.2017.06.057. 1051 56. Schapiro, A.C., Kustner, L.V., and Turk-Browne, N.B. (2012). Shaping of Object 1052 Representations in the Human Medial Temporal Lobe Based on Temporal Regularities. Curr. 1053 Biol. 22, 1622–1627. https://doi.org/10.1016/j.cub.2012.06.056. 1054 57. Barlow, H. (2001). Redundancy reduction revisited. Netw. Bristol Engl. 12, 241–253. 1055 58. Fusi, S., and Abbott, L.F. (2007). Limits on the memory storage capacity of bounded 1056 synapses. Nat. Neurosci. 10, 485–493. https://doi.org/10.1038/nn1859. 1057 59. Forest, T.A., Schlichting, M.L., Duncan, K.D., and Finn, A.S. (2023). Changes in statistical 1058 learning across development. Nat. Rev. Psychol. 2, 205–219. https://doi.org/10.1038/s44159-1059 023-00157-0. 1060 60. Peirce, J., Gray, J.R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., Kastman, E., 1061 and Lindeløv, J.K. (2019). PsychoPy2: Experiments in behavior made easy. Behav. Res. 1062

Methods

51, 195–203. https://doi.org/10.3758/s13428-018-01193-y. 1063 61. Hsu, N.S., Schlichting, M.L., and Thompson-Schill, S.L. (2014). Feature Diagnosticity Affects 1064 Representations of Novel and Familiar Objects. J. Cogn. Neurosci. 26, 2735–2749. 1065 https://doi.org/10.1162/jocn_a_00661. 1066 62. Schlichting, M.L., Mumford, J.A., and Preston, A.R. (2015). Learning-related representational 1067 changes reveal dissociable integration and separation signatures in the hippocampus and 1068 prefrontal cortex. Nat. Commun. 6, 8151. https://doi.org/10.1038/ncomms9151. 1069 63. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., 1070 Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An Imperative Style, High-Performance 1071 Deep Learning Library. Preprint at arXiv, https://doi.org/10.48550/arXiv.1912.01703 1072 https://doi.org/10.48550/arXiv.1912.01703. 1073 64. Kriegeskorte, N. (2011). Pattern-information analysis: From stimulus decoding to 1074 computational-model testing. NeuroImage 56, 411–421. 1075 https://doi.org/10.1016/j.neuroimage.2011.01.061. 1076 1077 1078 1079 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 23 1080 Supplementary Information Appendix 1081 1082 1083 1084 Fig. S1. 2AFC test performance by direct and indirect conflict question subsets. 1085 Bar height reflects group average on 2AFC context-dependent question subsetted by indirect 1086 (light coloring) and direct (dark coloring) conflict for Context A (left, orange bars) and Context B 1087 (right, green bars). ***p<0.001; **p<0.01; *p<0.05. 1088 1089 1090 1091 Fig. S2. Visualization of stimulus presentation during learning phase. 1092 Each trial of the learning phase featured four stimuli arranged as depicted, with one object of 1093 interest (on which participants needed to make an × /+ judgment) and three circular phase-1094 scrambled objects presented in the remaining positions. A black or white border was present 1095 during Expt. 2. Visualization of objects and border is to scale. 1096 1097 Table S1. Reaction times (mean ± standard deviation) in milliseconds by block for context-1098 dependent pair objects. Item 1 is the first, unpredictable element of each pair; Item 2 is the 1099 second, predictable element informed by the associative expectation. 1100 Expt 1: Unsignaled Expt 2: Signaled Item 1 Item 2 Item 1 Item 2 Block 1 712 ± 72 727 ± 75 710 ± 67 713 ± 57 Block 2 668 ± 77 679 ± 84 662 ± 76 664 ± 64 Block 3 650 ± 76 655 ± 85 648 ± 77 643 ± 62 Block 4 639 ± 74 639 ± 83 631 ± 71 626 ± 62 1101 1102 1103 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 24 S1 Confidence judgments and explicit knowledge assessments 1104 1105 We examined whether participants’ confidence ratings on the 2AFC task were related to their 1106 accuracy. Confidence was significantly higher for accurate than inaccurate 2AFC responses for 1107 context-dependent trials in both experiments (Unsignaled: t(49) = 4.1, p < 0.001; Signaled: t(49) = 1108 3.96, p < 0.001) and on context-independent trials in Expt. 1 (Unsignaled; t(37) = 2.6, p = 0.013) 1109 but not Expt. 2 (t(39) = 0.38, p = 0.7) (Fig. S3A). Participants who responded entirely correctly or 1110 incorrectly were excluded from analysis; all participants had mixed accuracy on context-1111 dependent trials. Overall, mean confidence ratings were around or below the midpoint of the 1112 scale, indicating generally low subjective certainty during the 2AFC task. 1113 1114 Following the 2AFC task, participants completed two additional assessments designed to 1115 measure explicit knowledge of the temporal structure: a Structure Knowledge Probe and Pair 1116 Reconstruction Task. 1117 1118 For the Structure Knowledge Probe, binary performance was assessed by manually evaluating 1119 whether participants articulated explicit awareness of temporal pair structure in their written 1120 responses. This measure did not evaluate knowledge of the dual context structure (e.g., 1121 participants did not need to articulate awareness that there were two distinct contexts where the 1122 associative pairings changed). Two independent raters coded all responses with 91% agreement; 1123 discrepancies were resolved by deferring to the more senior grader. Explicit knowledge of the pair 1124 structure was identified in 34.0% of participants in Expt. 1 and 40.0% in Expt. 2. 1125 1126 Performance on the Pair Reconstruction Task varied because participants could report between 1 1127 and 20 pairs. To estimate chance performance, we implemented Monte Carlo simulations where 1128 1,000 simulations were run for each possible number of reported pairs (k = 1-20). In each 1129 simulation, an object was sampled from the 11 unique objects with replacement between pairs 1130 but without replacement within a pair (e.g., no pair comprised of the same object). This produced 1131 a null distribution of proportion correct entries for each k expected by chance. This empirical 1132 approach matches the analytical solution: there were 9 correct pairs (because one pair was 1133 context-independent and thus correct in both contexts), and the probability of guessing one 1134 correct pair by chance was 1/110 (choosing 2 of the 11 objects without replacement). Thus, the 1135 probability of guessing one of the 9 correct pairs was 9/110, or 8.2%. 1136 1137 Participants reported an average of 7.4 ± 4 in Expt. 1 and 7.9 ± 4 in Expt. 2 (Fig. S3B). Group-1138 level significance was calculated as the average number of correct context-independent pairs 1139 was greater than expected by chance over the simulations of all possible pair entry counts. 1140 Context-dependent pair entry performance was non-significant for both experiments (Unsignaled: 1141 mean = 2.06 pairs; p = 0.11; Signaled: mean = 2.08 pairs; p = 0.11; Fig. S3B). Moreover, only 1142 36% of Expt. 1 participants and 34.7% of Expt. 2 participants (e.g., 17 out of 49 participants; one 1143 participant did not complete this portion of the experiment) reported the context-independent pair. 1144 1145 Taken together, these results indicate that most participants had little to no explicit knowledge of 1146 the temporal pair structure: they were generally unable to recall the context-independent pair, 1147 articulate the underlying pair structure, or reconstruct the context-dependent associations. Thus, 1148 the significant 2AFC performance reflecting context-dependent learning is unlikely to have been 1149 driven by explicit awareness. 1150 1151 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 25 1152 Fig. S3. Metacognitive awareness assessment results. 1153 (A) Average 2AFC confidence rating for context-dependent (left) and context-dependent (right) 1154 questions for accurate (pink) and inaccurate (blue) question responses. Horizontal dashed line 1155 indicates midpoint of the confidence scale, and error bars reflect SEM. (B) Pair Reconstruction 1156 Task performance: bar height reflects group average number of total pairs reported (left) and 1157 correct context-independent pairs reported (right). Horizontal line reflects chance performance 1158 based on Monte Carlo simulations; error bars reflect SEM. 1159 1160 S2 Model architecture comparison 1161 1162 In the main paper, we analyze a GRU model with training that was constrained to a single epoch, 1163 equivalent to the total sequence exposure of each human participant. Here, we justify that 1164 decision with comparison to two simpler models: a feedforward neural network (FFNN) and a 1165 vanilla recurrent neural network (RNN) that lacked gated recurrent units. One learning phase 1166 sequence and one set of 2AFC questions were generated for each model in the same way as for 1167 human participants (1,600 objects), and each epoch of training consisted of updating model 1168 weights to predict the next item in this sequence, followed by an assessment of 2AFC accuracy 1169 with frozen weights. For each model, we continued this process for a total of 50 epochs (i.e. 50 1170 times the sequence exposure given to human participants). All models here used the default 1171 PyTorch weight initialization where weights are drawn from a uniform distribution bounded by plus 1172 or minus the inverse of the square root of the layer size, which was 0.08 for the hidden layer. 1173 1174 The simplest architecture, the FFNN, achieved an overall context-dependent accuracy of almost 1175 75% (Fig. S4A). However, this performance was entirely driven by near-perfect accuracy on 1176 Context B, the most recently trained context, while accuracy on Context A remained near chance. 1177 This indicates that the FFNN retained knowledge only about the most recent associations, 1178 completely overwriting previously learned, conflicting ones—a hallmark of catastrophic 1179 interference. 1180 1181 RNNs improve on feedforward model capabilities by incorporating information from past hidden 1182 states with the current state, enabling them to process sequential input. However, the RNN 1183 showed no improvement in overall context-dependent accuracy compared to the FFNN (Fig. 1184 S4B). While Context A performance did increase over learning, this improvement came at the 1185 expense of Context B performance, suggesting the RNN is also prone to interference. 1186 1187 The RNN with GRUs, an advanced RNN variant, overcomes the limitations by using update and 1188 reset gates to manage long-term dependencies more effectively. Initially, GRU performance was 1189 comparable to the FFNN (Fig. S4C). However, with extended training (approximately 20 epochs; 1190 i.e., 20 times the exposure of human participants), the GRU achieved comparably high accuracy 1191 on both Context A and Context B. This performance likely stems from the GRU’s architectural 1192 advantages. The update gate controls how much new input influences retained memory, allowing 1193 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 26 the model to ignore unreliable input (such as noisy between-pair transitions). The reset gate 1194 allows the selective clearing of irrelevant information in response to context changes, thereby 1195 avoiding interference from outdated associations. Given that the GRU model provides the best 1196 account for context-dependent learning in humans, we used the GRU model for all of the single-1197 epoch modeling analyses reported in the main text. 1198 1199 1200 Fig. S4. 2AFC task performance for three neural network model classes. 1201 (A-C) 2AFC performance accuracy on context-dependent questions averaged for 50 individual 1202 models for each architecture after each of 50 epochs of training; accuracy is plotted separately for 1203 Context A (orange), Context B (green), and both contexts combined (blue) for each model class: 1204 (A) simple feedforward neural network (FFNN), (B) vanilla recurrent neural network (RNN), and 1205 (C) recurrent neural network with gated recurrent units (GRU). All models achieved perfect 1206 accuracy on context-independent questions after the first epoch (not pictured). 1207 S3 GRU performance with perceptual object representations 1208 1209 The main paper used one-hot vector representations for each object in the modeling analysis. 1210 This choice ensured that all objects were represented equally and orthogonally, such that any 1211 structure emerging in the hidden layer reflected purely learned associations rather than 1212 preexisting similarities among the inputs. Here, we present the same analysis using perceptual 1213 object representations that more closely approximate the visual experience of human participants 1214 in the task. Perceptual object representations were generated by inputting each object image 1215 (without the overlaid plus or minus symbol) into AlexNet (1) and then applying PCA to reduce the 1216 dimensionality to 11 dimensions, matching the number of input and output dimensions of the 1217 original model. GRU networks were trained on the same context-dependent sequential prediction 1218 task as in the main text, using cosine similarity as the loss function, and assignment of objects to 1219 specific pairs was randomized for each model in the same way as for human participants. 1220 As shown in Fig. S5, the relationship between initialized weight variance and 2AFC accuracy 1221 retained the same non-monotonic profile observed in the models that used one-hot input coding 1222 (Fig. 4A). In addition, 2AFC performance was higher for all question subsets. This improvement is 1223 unsurprising: the use of AlexNet embeddings introduces a strong visual prior that allows the 1224 model to exploit shared perceptual features when making predictions, thereby obscuring the 1225 interpretability of how the hidden layer activity represents the task’s temporal associative 1226 structure. For example, the model could leverage arbitrary similarities in dimensions such as 1227 shape and color to bias its 2AFC responses. In contrast, one-hot encodings constrain all non-1228 active dimensions to zero, ensuring that any hidden layer structure arises exclusively from 1229 learning the task’s associative regularities. Taken together, these results confirm that the core 1230 finding of optimal task performance emerging at moderate weight initialization variance holds 1231 regardless of input representation. We therefore focus analysis on models trained with one-hot 1232 object encodings because they provide a controlled representational space in which hidden layer 1233 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 27 structure reflects learning task’s associative structure rather than preexisting perceptual 1234 relationships among stimuli. 1235 1236 1237 Fig. S5. 2AFC performance with perceptual object embeddings. 1238 2AFC accuracy (y-axis) on context-dependent test trials for GRU models with weights initialized 1239 with increasing variance along the x-axis color-coded by question category. 1240 1241 1242 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 28 S4 GRU performance with a fully overlapping stimulus set (no context-specific objects) 1243 1244 The pair assignment across contexts for both models and human participants included one 1245 possible caveat to our claim of latent context-dependent learning: for one of the context-1246 dependent pair types, the second (paired) item appeared in only one of the contexts while the first 1247 item occurred in both (visualized in bottom row of Fig. 1B). This structure still required humans 1248 and the models to update their prediction of the second item based on the inferred context 1249 (consistent with all other context-dependent pairs), but also meant that the context-specific 1250 second item could have been used as a context cue independent of recent sequence history 1251 (Expt. 1) and/or border color (Expt. 2). In other words, an encounter with a context-specific object 1252 could be an indicator that the state of the world has changed and thus that one’s associative 1253 predictions should be updated. We note that our decision to include this pair type was motivated 1254 by our intention to collect fMRI data with this paradigm, which will allow us to assess changes in 1255 neural representational geometry when an object’s associative identity remains constant across 1256 contexts, providing a baseline for evaluating relative changes in other pair conditions. 1257 1258 To evaluate whether the presence of context-specific objects influenced model learning, we ran 1259 neural network simulations in which such pairs were removed and replaced with context-1260 dependent pairs for which both objects could occur in either context. These models were trained 1261 no the same task, but the context-dependent pairs were reconfigured to maintain the overall 1262 object set, with second-item assignments shuffled across contexts. Model input and output 1263 dimensions were therefore reduced to 10, corresponding to the 10 unique object encodings 1264 needed to instantiate this modified pair set. All other training parameters and analysis of 2AFC 1265 performance were identical to those in the main text. 1266 1267 As shown in Fig. S6, model performance across weight initializations closely mirrored the results 1268 of the main analysis (Fig. 4A), indicating that learning dynamics and context-dependent accuracy 1269 were unaffected by the presence or absence of the context-specific object. This indicates that 1270 such objects did not serve as reliable context cues for our models. 1271 1272 1273 Fig. S6. 2AFC performance with no context-specific objects measured by weight variance. 1274 2AFC accuracy (y-axis) on context-dependent test trials for GRU models with weights initialized 1275 with increasing variance along the x-axis color-coded by question category. 1276 1277 S5 Determination of hidden layer size 1278 1279 The main paper analyzes a GRU model with 150 hidden layer units. To assess whether model 1280 capacity influenced learning performance, we trained GRU models with reduced hidden layer 1281 sizes of 50 and 100 units. As shown in Fig. S7, all models ultimately achieved comparable 1282 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint 29 performance converging to approximately 90% accuracy. However, models with fewer hidden 1283 layer units exhibited slower learning trajectories, requiring more training to reach the same level 1284 of performance as the model with 150 units. These results suggest that while increasing the 1285 number of hidden units accelerates learning, overall task performance is largely independent of 1286 model size. 1287 1288 1289 Fig. S7. 2AFC task performance for GRU models with varying hidden layer sizes. 1290 (A-C) 2AFC performance accuracy on context-dependent questions averaged for 50 individual 1291 models for each architecture after each epoch of training for Context A (orange), Context B 1292 (green), and both contexts combined (blue) for GRU models with (A) 50 hidden layer units, (B) 1293 100 hidden layer units, and (C) 150 hidden layer units. 1294 1295 1296

References

1297 1298 1. A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep 1299 Convolutional Neural Networks in Advances in Neural Information Processing Systems, 1300 (Curran Associates, Inc., 2012). 1301 1302 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted March 18, 2026. ; https://doi.org/10.64898/2026.03.17.712206doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0